Running complex test suites against a financial transaction system produces huge amounts of responses, both expected and unexpected. In this article, we outline our experience of using ML for reliable automatic extraction of "that" unexpected response from a big number of same type messages produced a by system under test. We describe classification approaches and data manipulations we have tried, and explain the final choices. Also we outline business constraints and final design decisions for the resultant tool. We also address the task of classifying difference patterns between expected and actual responses in attempt to provide automated pre-judgement on a reason for test failure. We outline clustering considerations and results achieved.