A good A/B test tool should be able to reach the following conclusions:
- A beat B or B beat A, so you can stop.
- Neither A nor B beat the other, so you can stop.
- We can’t conclude #1 or #2 but you’ll need about m more data points to conclude one of them.
The tools I’ve found for analyzing A/B tests can all answer #1. Some of the better ones can answer #3. None of the tools I’ve seen will answer #2 and tell you that A and B are not meaningfully different and that you have enough data to be pretty sure about that.
This has to do with the way most people use hypothesis testing. Stats students are taught to test the simple hypothesis “Is the amount B improves over A positive?” They get a p-value (related to the notion of the effect actually being negative) and go from there. The problem is that the probability that the effect is precisely zero is zero.
Here’s a fix for that: Choose a minimum meaningful effect and test the hypothesis that the absolute value of the effect is smaller than that value.
Here’s an example. Recently I was testing the hypothesis that version B of an app improved daily listening times over version A of the same app. Daily listening times for this app are around 40 minutes and cost difference for implementing B over A is small, so our product owner and I decided that any change less than 30 seconds was not meaningful. This left me with three hypotheses to test:
- Listening times for B are more than 30 seconds longer per day than for A (B wins).
- Listening times for A are more than 30 seconds longer per day than for B (A wins).
- Listening times for B and A are within 30 seconds of each other (tie).
I can test all of these hypotheses with standard hypothesis tests. If none are true, I can assume the mean difference in times is correct (it is our best estimate of the mean, given the data) and do a power calculation (although this is not the standard calculation, it’s pretty straightforward) to tell how much more data I need.
All three questions answered.
I implemented this in R using the ‘shiny‘ package to make it an interactive Web-based tool. Sample code is here on Github. You’ll need a server with shiny-server installed to use and test it; I found it trivial to install on my Ubuntu server.
After gradually migrating most of my workflow from Subversion to GitHub I discovered an itty, bitty, tiny, huge freakin’ problem. Part of my old workflow involved me storing code I would use again and again in a public repository then source-ing the code directly into R as needed. It also made this code easy for me to share with others, especially students and collaborators. No problem.
GitHub is superior to Subversion in notable ways, but that’s not our topic here. GitHub does make it easy to read source code directly from the site as plain text. Here’s an example of an address for a bit of code I use almost daily to give me a clean R session. Continue reading ‘Make GitHub R Code Available within R’ »
Political scientist Andrew Hacker recently asked “Is Algebra Necessary?” and the response has, unfortunately, been predictable.
Those in society’s minority who did well in math courses are “shocked” at the suggestion that we change the typical math curriculum. The teaching may be “dismal” but algebra is a “foundation stone” in developing critical thinking skills. “It teaches one how to think.” It’s a little amusing but mostly disheartening to see folks who claim to support more challenging math standards fall back on strawman arguments, condescension, sarcasm and, my favorite, math errors in their arguments.
Those in society’s majority who did poorly in math tended to respond with relief at the suggestion of dropping algebra, although there are a few PMSD (post-mathematics stress disorder) victims whose career paths were altered by failing math and who still carry the associated baggage and resentment.
Let’s set aside the hysterics (“We are breeding a nation of morons“) and give both sides of this debate a fair shake, shall we? Continue reading ‘Is Algebra Necessary? Yes and No.’ »
Having recently committed myself to earning my living as a Data Scientist, I’ve been reading anything I can find to guide my self-education. So I just spent the last hour reading and mulling over DJ Patil‘s article/report Building Data Science Teams (BDST henceforth) which is available free from various outlets; I read the Kindle version. (Disclaimer: DJ is a friend and occasional drinking buddy.) Continue reading ‘Review of “Building Data Science Teams” by DJ Patil’ »
Yesterday I got back from a great APSA in Seattle. My undergraduate students were despondent at me having to cancel class Thursday so I could attend. A few were curious about what happens at a scientific conference and asked about the structure. I explained that there would be several thousand political scientists at this conference and that most of the planned interaction would take place in panels. Continue reading ‘Planned Serendipity’ »
Next time you see someone “misinterpret” a confidence interval, wait a second. They’re actually probably okay. Continue reading ‘Regain your confidence (intervals)’ »
What is a methods-careful practitioner to do when the number of observations ($latex n$) is small? I don’t know how many times I’ve been told by a well-meaning Bayesian some variation of
Bayesian estimation addresses the “small $latex n$ problem”
This is right and wrong. Continue reading ‘Bayes fixes small n, doesn’t it?’ »
How do we show a statement about politics is true? Analytic formal modelers suggest one way:
Continue reading ‘Truth and Choices: Computational v. Analytical formal models’ »
I served in the US Navy for a few months in 1986, five years in the early 90s, and another year and a half in the reserves. I was never asked to shoot someone. I never pulled a trigger when the weapon was aimed at a person. I served during, but not “in” the first Gulf War. I served during “peacetime”, or at least that’s how I thought about it. However, over the last few months I have been thinking more about my time in uniform, realizing the lasting and deep effects that experience had on me. Continue reading ‘We all carry the scars’ »
My dad and I went to the recent Brown/Whitman California gubernatorial debate here at UC Davis. It was fun seeing “democracy” live and up close. One of the candidates twice repeated an old saw:
One definition of insanity is doing the same thing over and over and expecting different results.
Continue reading ‘Change of Intuition about the Definition of Insanity’ »