Author: Adam G. Dobrakowski
Redaction: Zuzanna Kwiatkowska
This post is the third and last in a series of posts about A/B testing. The others are:
- A/B Testing in Machine Learning. Part 1: How to prepare the A/B tests?
- A/B Testing in Machine Learning. Part 2: Most common problems
Here, I will show you common mistakes that inexperienced and even advanced ML engineers struggle with.
Usually, a lot of people are involved in conducting A/B tests. ML engineers/data analysts, but also people responsible for the deployment or operation of a given element on the client’s side. Sometimes, we also include people who work with the model. Most of these people may not know the details of the methodology behind A/B testing.
Therefore, even for an experienced engineer, reliable A/B testing can be a challenge. Remember that you are the person most responsible for the correct methodology and interpretation of A/B tests, and you must be sensitive to possible errors.
Unknown hypothesis or time and scope of the experiment
Remember that the entire experiment is based on a hypothesis. You can’t run an A/B test to “see what happens”, or only after running it think about how you’re going to check the results of the algorithm. You should have a clear expectation of what metric you want to check and for how long the experiment will be conducted.
Testing many things at once
The problem can arise if you test many different optimisations in one experiment (e.g. deploying several models at once for different uses), but also if, for example, you use multiple metrics. When you’re testing too many things simultaneously, it’s hard to tell which one caused success or failure. As a result, test prioritisation is critical to the success of A/B testing.
To solve this problem, you can either limit yourself to the most important aspects first or think about A/B/C testing, where you test two aspects, with each pair having one element in common.
Ignoring statistical significance
Sometimes, we can all too easily believe in the success or failure of our experiment without waiting for it to come to an end within the agreed scope. This is a big mistake.
Remember that randomness is inherent in most phenomena, so if your test is not statistically significant, the size of the metrics obtained does not matter. Don’t let critical decisions about your model be made without statistical significance.
Uncertainty about the correctness of the experiment
In the previous article, I pointed out various problems in A/B testing. You must exclude them before starting the experiment (e.g. the fact that one group does not interfere with the other). You also need to ensure that the model works correctly, that the results are collected correctly, etc.
Maybe you feel I cover it too shallowly, but believe me, this is a very common problem. Probably most of the failed A/B tests (at least in the first iteration) are due to problems such as a non-working model, incorrectly delivered or unincluded predictions, as well as poorly collected results, not because our model is weak.
Practical advice
If you want to make sure that your A/B testing environment is set up correctly and that the two groups are indeed identical, you can run the same model in both groups (or even collect results for some time without a model) and see if the results of both groups will converge. If they don’t converge, there’s little chance you’ll get anything from A/B testing of different models.
Summary
I have presented you with the most common mistakes I have encountered in my work. Unfortunately, I think they are very common. I hope this post will help you reduce them. Or maybe you have a different experience? Have you encountered other errors?
I’ll be happy to hear from you and talk!