By Pavel Dmitriev
The idea of the presenters is to create a large experimentation platform to do A/B testing on features, to decide whether a feature should be included in the software or not. It sounds quite simple, but A/B testing turns out to be very difficult, because random is not necessarly random. Therefore, they first do an A/A test, and then a A/B test, just to ensure that they have a good experiment setup. However, this is not ideal. A better solution is to run A/A/ retrospectively, i.e., do the A/B test, and then, split the results of the A group in an A/A group, to validate the results. An even better solution is SeedFinder: let’s do the split a 1000 times, and then use the best split. For the architecture this gave quite some complications. For example, reproducing the off-line splitting of users is not a trivial task.
Another problem is variant assignment: how do you split? Flip a coin does not work, as you want a more persistent assignment, as people can hop on and off of the application. The solution was to has user ids into a bucket, on a numberline, and each bucket was assigned to an A/B group. Another aspect they considered are overlapping experiments: do you want interactions between experiments, e.g. different feature experiments may influence the outcome of both experiments. In the system they implemented this as the isolation group principle: if two experiments share an isolation group, the experiments will be non-overlapping.
They did this on a large scale, across different Microsoft products.