Introducing new technical solutions can be risky, as predicting the consequences of such changes is often difficult. In this article, we will show how A/B testing, a solution typically used to verify UI and UX changes, can also be used as a tool to evaluate technical implementations.
In a B2B chat application we are developing, we strongly support offline mode, where users can create messages offline, and they are synchronized later as soon as possible. This is crucial because our target market is emerging markets. However, there was no such functionality for sending attachments (photos, videos, files, etc.).
In addition, the technical solution we were using to send photos had not been updated for many years, but it was stable. We had the choice of considering improving the current solution or trying a major change and using the WorkerManager solution recommended by the Android team, which automatically supports the ability to schedule tasks until the internet returns.
However, WorkManager has quite imprecise documentation regarding the limits that apply to the application that uses it. The documentation indicates limits but does not specify what exactly they are, making it difficult to decide whether they are sufficient. Additionally, in the past, when this solution was in its early versions, we encountered some difficulties on some devices. Still, over time, it has become clear that the solution has become stable.
At the same time, using an external tool that would be responsible for sending, retrying, and scheduling sending was very tempting, as we are a very small team that prefers to focus its efforts on the most critical places rather than maintaining technical debt.
However, "we think" and "it seems to me" are just opinions that are not supported by facts. We decided to test which solution would be better. We decided to create an A/B test where:
- variant “A” would be the current implementation,
- variant “B” would be the new implementation using Work Manager.
From the user interface perspective, there were practically no changes in the application in both versions.
The key metrics we chose were:
- Success ratio for sending an attachment.
- Number of attachments sent per person.
- Percentage of users who successfully sent at least one attachment.
- Crash ratio
After implementation, the first A/B test clearly gave us information that there is a critical bug in variant “B” that causes a crash that was not detected during testing and occurs on specific Android versions. The A/B test allowed us to revert to version “A” for all users, so the bug only occurred a few times for two users.
After fixing the application and releasing a new version, we ran the A/B test again.
After two weeks, we gathered the following results:
Variant | A | B | Description |
---|---|---|---|
Success ratio for sending an attachment | 88% | 94% | Increase of 6.74%. Statistically significant with 99% probability. |
Number of attachments sent per person. | 1.83 | 1.96 | Increase of 7%, which is close to the previous result. The result is statistically significant with 99% probability. |
Percentage of users who successfully sent at least one attachment | 9.09% | 9.11% | Statistical significance cannot be demonstrated in terms of the number of users sending attachments with either version due to the sample size being too small to discover statistical significance in both versions. |
Crash ratio | Baseline | -0.078% | Statistically insignificant decrease in crash ratio. |
Summary of the experiment:
We see a significant increase in the success ratio and number of attachments sent by 6.74% and 7%, respectively. The application's crash ratio did not increase. We do not observe a significant increase in the number of people who were able to send at least one attachment. The lack of an increase in the number of people who sent at least one attachment is caused by the fact that people who were unable to send one attachment also sent other attachments, so they are still counted in the success group. That's why, even if we extended the experiment time, we would probably collect more people who give up using this functionality in the application after one failed attachment attempt.
However, the key metrics have been met, so we are making the decision to roll out the feature to all users.
What do we gain by A/B testing technical implementations:
- We can quickly roll back a faulty technical implementation without affecting the experience of current users.
- We don't have to rely on gut feeling that the new technical implementation is better, we are sure that the new implementation is better than the previous one.
- We can be sure that the functionality works, not just during internal testing, but with real users.