Getting Past Statistical Significance: Foundations of AB Testing and Experimentation


How often is AB Testing reduced to the following question:  ‘what sample size do I need to reach statistical significance for my AB Test?’  On the face of it, this question sounds reasonable. However, unless you know why you want to run a test at particular significance level, or what the relationship is between sample size and that significance level, then you are most likely missing some basic concepts that will help you get even more value out of your testing programs.

There are also a fair amount of questions around how to run AB Tests; what methods are best; and the various ‘gotchas’ to look out for.  In light of this, I thought it might be useful to step back, and just review some of the very basics of experimentation and why we are running hypothesis tests in the first place. This is not a how to guide, nor a collection of different types of tests to run, or even a list of best practices.

What is the problem we are trying to solve with experiments?

We are trying to isolate the effect on some objective result, if any, of taking some action in our website (mobile apps, call centers, etc.). For example, if we change the button color to blue rather than red, will that increase conversions, and if so, by how much?

What is an AB test?

AB and multivariate tests are versions of randomized controlled trials (RCT). A RCT is an experiment where we take a sample of users and randomly assign them to control and treatment groups. The experimenter then collects performance data, for example conversions or purchase values, for each of the groups (control, treatment).

I find it useful to think of RCTs as having three main components:  1) data collection; 2) estimating effect sizes; and 3) assessing our uncertainty of the effect size and mitigating certain risks around making decisions based on these estimates.

 

Collection of the Data

What do you mean by sample?

A sample is a subset of the total population under investigation. Keep in mind that in most AB testing situations, while we randomly assign users to treatments, we don’t randomly sample. This may seem surprising, but in the online situation the users present themselves to us for assignment (e.g. they come to the home page).  This can lead to selection bias if we don’t try to account for this nonrandom sampling in our data collection process. Selection bias will make it more difficult, if not impossible, to draw conclusions about the population we are interested in from our test results. One often effective way to mitigate this is by running our experiments over full weeks, or months etc. to try to ensure that our samples look as much as possible like our user/customer population.

Why do we use randomized assignments?

Because of “Confounding”. I will repeat this several times, but confounding is the single biggest issue in establishing a causal relation between our treatments and our performance measure.

What is Confounding?

Confounding is when the treatment effect gets mixed together with the effects from any other outside influence. For example, consider we are interested in the treatment effect of Button Color (or Price, etc.) on conversion rate (or average order size, etc). When assigning users to button color we give everyone who visits on Sunday the ‘Blue’ button treatment, and everyone on Monday the ‘Red’ button treatment. But now the ‘Blue’ group is comprised of both Sunday users and the Blue Button, and the ‘Red’ group is both Monday users and the Red Button. Our data looks like this:

Sunday:Red 0%   Monday:Red 100%
Sunday:Blue 100%   Monday:Blue 0%

We have mixed together the data such that any of the user effects related to day are tangled together with the treatment effects of button color.

What we want is for each of our groups to both: 1) look like one another except for the treatment selection (no confounding); and 2) to look like the population of interest (no selection bias).

If we randomly assign the treatments to users, then we should on average get data that looks like this:

Sunday:Red 50%   Monday:Red 50%
Sunday:Blue 50%   Monday:Blue 50%

Where each day we have a 50/50 split of button color treatments.  Here the relationship between day and button assignment is broken, and we can estimate the average treatment effects without having to worry as much about influences of outside effects (this isn’t perfect of course, since it holds only on average – it is possible due to sampling error that for any given sample we don’t have a balanced sample over all of the cofactors/confounders.)

Of course, this mixing need not be this extreme – it is often much more subtle. When Ronny Kohavi advises to be alert to ‘sample ratio mismatch’, (See: https://exp-platform.com/hbr-the-surprising-power-of-online-experiments/), it is because of confounding. For example, say a bug in the treatment arm breaks the experience in such a way that some users don’t get assigned. If this happens only for certain types of users, perhaps just for users on old browsers, then we no longer have a fully randomized assignment.  The bug breaks randomization and lets the effect of old browsers leak in and mix with the treatment effect.

Confounding is the main issue one should be concerned about in AB Testing.  Get this right and you are most of the way there – everything else is secondary IMO.

Estimating Treatment Effects

We made sure that we got random selections, now what?

Well, one thing we might want to do is use our data to get an estimate of the conversion rate (or AOV etc.) for each group in our experiment.  The estimate from our sample will be our best guess of what the true treatment effect will be for the population under study.

For most simple experiments we usually just calculate the treatment effect using the sample mean from each group and subtract the control from the treatment –  (Treatment Conversion Rate) – (the Control Conversion Rate) = the Treatment Effect.  For example, if we estimate that Blue Button has a conversion rate of 0.1 (1%) and Red Button has a conversion rate of 0.11 (1.1%), then the estimated treatment effect is -0.01.

Estimating Uncertainty

Of course the goal isn’t to calculate the sample conversion rates, the goal is to make statements about the population conversion rates.  Our sample conversion rates are based on the particular sample we drew. We know that if we were to have drawn another sample, we almost certainly would have gotten different data, and would calculate a different sample mean (if you are not comfortable with sampling error, please take a look at https://conductrics.com/pvalues).

One way to assess uncertainty is by estimating a confidence interval for each treatment and control group’s conversion rate. The main idea is that we construct an interval that is guaranteed to contain, or trap, the true population conversion rate with a frequency that is determined by the confidence level. So a 95% confidence interval will contain the true population conversion rate 95% of the time.  We could also calculate the difference in conversion rates between our treatment and the control group’s and calculate a confidence interval around this difference.

Notice that so far we have been able to: 1) calculate the treatment effect; and 2) get a measure of uncertainty in the size of the treatment effect with no mention of hypothesis testing.

Mitigating Risk

If we can estimate our treatment effect sizes and get a measure of our uncertainty around the size, why bother running a test in the first place? Good question.  One reason to run a test is to control for two types of error we can make when taking an action on the basis of our estimated treatment effect.

Type 1 Error – a false positive.

  1. One Tail: We conclude that the treatment has a positive effect size (it is better than the control) when it doesn’t have a real positive effect (it really isn’t any better).
  2. Two Tail: We conclude that the treatment has a different effect than the control (it is either strictly better or strictly worse) when it doesn’t really have a different effect than the control.

Type 2 Error – a false negative.

  1. One Tail: We conclude that the treatment does not have a positive effect size (it isn’t better than the control) when it does have a real positive effect (it really is better).
  2. Two Tail: We conclude that the treatment does not have a different effect than the control (it isn’t either strictly better or strictly worse) when it really does have a different effect than the control.

How to specify and control the probability of these errors?

Controlling Type 1 errors – the probability that our test will make a Type 1 error is called the significance level of the test. This is the alpha level you have probably encountered. An alpha of 0.05 means that we want to run the test so that we only make Type 1 errors up to 5% of the time. You are of course free to pick whatever alpha you like – perhaps an alpha of 1% may make more sense for your use case, or maybe an alpha of 0.1%. It is totally up to you! It all depends on how damaging it would be for you take some action based on a positive result, when the effect doesn’t exist. Also keep in mind that this does NOT mean that if you get a significant result, that only 5% (or whatever your alpha is) of the time it will be a false positive.   The rate that a significant result is a false positive will depend on how often you run tests that have real effects.  For example, if you never run any experiments where the treatments are actually any better than the control, then all of your significant results will be false positives.  In this worse case situation, you should expect to see significant results in up to 5% (alpha%) of your tests, and all of them will be false positives (Type 1 errors).

You should spend as much time as needed to grok this idea, as it is the single most important idea you need to know in order to thoughtfully run your AB Tests.

Controlling Type 2 errors – this is based on the power of the test, which is turn based on the beta. For example, a beta of 0.2 (Power of 0.8) means that of all of the times that the treatment is actually superior to the control, your test would fail, on average, to discover this up to 20% of the time. Of course, like the alpha, it is up to you, so maybe power of 0.95 makes more sense, so that you make a type 2 error only up to 5% of the time.  Again, this will depend on how costly you consider this type of mistake.  This is also important to understand well, so spend some time thinking about what this means.  If this isn’t totally clear, see a more detailed explanation of Type 1 and Type 2 errors here https://conductrics.com/do-no-harm-or-ab-testing-without-p-values/.

What is amazing, IMO, about hypothesis tests is that, assuming that you collect the data correctly, you are guaranteed to limit the probability of making these two types of errors based on the alpha and beta you pick for the test. Assuming we are mindful about confounding, all we need to do is collect the correct amount of data. When we run the test after we have collected our pre-specified sample, we can be assured that we will control these two errors at our specified levels.

 

“The sample size is the payment we must make to control Type 1 and Type 2 errors.”

 

What about Sample Size?

There is a relationship between alpha, beta, and the associated sample size. In a very real way, the sample size is the payment we must make in order to control Type 1 and Type 2 errors. Increasing the error control on one means you either have to lower the control on the other or increase the sample size.  This is what power calculators are doing under the hood — calculating the sample size needed, based on a minimum treatment effect size, and desired alpha and beta.

 

What about Sample Size for continuous conversion values, like average order value?

Calculating sample sizes for continuous conversion variables is really the same as for conversion rates/proportions. For both we need to have some guess of the both the mean of the treatment effect and the standard deviation of the effect.  However, because the standard deviation of a proportion is determined by its mean, we don’t need to bother to provide it for most calculators. However, for continuous conversion variables we need to have an explicit guess of the standard deviation in order to conduct the sample size calculation.

What if I don’t know the standard deviation?

This isn’t exact, and in fact it might not be that close, but in a pinch, you can use the range rule  as a hack for the standard deviation.  If you know the minimum and maximum values that the conversion variable can take (or some reasonable guess), you can use standard deviation ⩰ (Max-Min)/4 as a rough guess.

What if I make a decision before I collected all of the planned data?

You are free to do whatever you like. Trust me, there is no bolt of lightning that will come out of the sky if you stop a test early, or make a decision early. However, it also means that the Type 1 and Type 2 risk guarantees that you were looking to control for will no longer hold. So to the degree that they were important to you and the organization, that will be cost of taking an early action.

What about early stopping with sequential testing?

Yes, there are ways to run experiments in a sequential way.  That said, remember how online testing works. Users present themselves to the site (or app or whatever), and then we randomly assign them to treatments. That is not the same as random selection.

Why does that matter?

Firstly, because of selection bias. If users are self selecting when they present themselves to us, and if there is some structure to when different types of users arrive, then our treatment effect will be a measure of only the users who we have seen, and won’t be a valid measure of the population we are interested in. As mentioned earlier, often the best way to deal with this is to sample in the natural period of your data – normally this is weekly or monthly.

Secondly, while there are certain types of sequential tests that don’t bias our Type 1 and 2 errors they do, ironically, bias our estimated treatment effect – especially when stopping early, which is the very reason you would run a sequential test in the first place. Early stopping can lead to a type of magnitude bias – where the absolute value of the reported treatment effects will tend to be too large.  There are ways to adjust try to adjust for this,  but it adds even more approximation and complexity into the process.
See https://www.ncbi.nlm.nih.gov/pubmed/22753584  and http://journals.sagepub.com/doi/abs/10.1177/1740774516649595?journalCode=ctja

So the fix for dealing with the bias in Type 1 error control due to early stopping/peaking CAUSES bias in the estimated treatment effects, which, presumably, are also of importance to you and your organization.

The Waiting Game

However, if all we do is just wait –  c’mon, it’s not that hard 😉 – and run the test after we collect our data based on the pre-specified sample size, and in weekly or monthly blocks, then we don’t have to deal with any issues of selection bias or biased treatment effects. This is one of those cases where just doing the simplest thing possible gets you the most robust estimation and risk control.

What if I have more than one treatment?

If you have more than one treatment you may want to adjust your Type 1 error control to ‘know’ that you will be making several tests at once. Think of each test as a lottery ticket. The more tickets you buy, the greater the chance you will win the lottery, where ‘winning’ here means making a Type 1 error.

The chance of making a single Type 1 error over all of the treatments is called the Familywise Error Rate (FWER). The more tests, the more likely you are to make a Type 1 error at a certain confidence level (alpha). I won’t get into the detail here, but to ensure that the FWER is not greater than your alpha, you can use any of the following methods:  Bonferroni; Sidak; Dunnetts etc. Bonferroni is least powerful (in the Type 2 error sense), but is the simplest with the least assumptions, so a good safe bet, esp if Type 1 error is a very real concern. One can argue which is best, but it will depend, and for just a handful of comparisons it won’t really matter what correction you use IMO.

Another measure of familywise error is the False Discovery Rate (FDR) (See “>https://en.wikipedia.org/wiki/False_discovery_rate).  To control for FDR, you could do something like the Benjamini–Hochberg procedure.  While controlling for the FDR means a more powerful test (less Type 2 error), there is no free lunch, and it is at the cost of allowing more Type 1 errors. Because of this, researchers often use the FDR as a first step to screen out possibly interesting treatments in cases where there are many (thousands) of independent tests. Then, from the set of significant results, more rigorous follow up testing occurs.  Claims about preference around controlling for either FDR or FWER are really implicit statements about relative risk of Type 1 and Type 2 error.

Wrapping it up

The whole point of the test is to control for risk – you don’t have to run any tests to get estimates of treatment effects, or a measure of uncertainty around those effects. However, it is often is a good idea to control for these errors, so the more you understand their relative costs, the better you can determine how much you are willing pay to reduce the chances of making them. Rather then look at the sample size question as a hassle, perhaps look at it as an opportunity for you and your organization to take stock and discuss what the goals, assumptions, and expectations are for the new user experiences that are under consideration.

 


Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*