How often can Thomas Bayes check the results of his A/B test?

Stopping your A/B test once you reach significance is a great way to find bogus results…if you’re a frequentist.  Checking before you have the statistical power to detect the phenomenon will often lead to false positives if you rely on classical/frequentist methods.  A Bayesian with an informative null-result prior can avoid these problems.  Let’s think about why.

Suppose we are doing a difference of means* on the time a user spends on a site.  To the frequentist, the difference of means is entirely a function of the data.  This means that as we run the test and check as we go, the first time the data (randomly?) suggests significance, we probably don’t really have a significant result.  The p-value is .05, but that means that 5% of the time when there’s no effect, we’re wrong; by checking regularly, we can easily turn what should be a null result into what appears to be a significant result.  Such is folly.

Now, consider the Bayesian using an informative prior that the difference of means is zero.  The result is a balance between the data and the prior.  If there is no effect, the result is a balance between the prior suggesting the null result and the data randomly fluctuating.  By balancing it with the null prior, the result of the test fluctuates less and is not likely to give a false positive result.

Note that the prior has to be informative.  If we use a flat prior, there is nothing to balance out the fluctuations in the data and we just get the (false positive) frequentist result.  The prior has to be informative to protect us from false positives.

For my A/B tests of the difference of means* I like to use a normal prior for the difference having mean zero and having a standard deviation so that there is only a 5% chance of it having an effect larger than the minimum meaningful effect. R code to calculate the standard deviation might look like

prior.sd <- - MinimumMeaningfulEffect / qnorm(.05/2)

What’s the downside?  Your results are the weighted average of the data and the prior, so it will take more data to get a positive result.  If you have very little data, this is not the approach for you: design your study well using power calculations and be patient.  However, if you have a lot of data, this can be just the ticket.

Trust your A/B tests and check them as you go — but do it right.


 

* Or medians — they’re just as easy to model in a Bayesian context.

Leave a Reply