What is the value of my AB Testing Program?

Occasionally we are asked by companies how they should best assess the value of running their AB testing programs. I thought it might be useful to put down in writing some of the points to consider if you find yourselves asked this question.

With respect to hypothesis tests, there are two main sources of value:
1) The Upside – reducing Type 2 error. 
This is implicitly what people tend to think about in Conversion Rate Optimization (CRO) – the gains view of testing. When looking to value testing programs, they tend to ask something along the lines of ‘What are the gains that we would miss if we didn’t have a testing program in place?’ One possible approach is to reach back into the testing toolkit and create a test program control group.  The users assigned to this control group are then shielded from any of  the changes made based on outcomes from the testing process. This control group is then used to estimate a global treatment effect for the bundle of changes over some reasonable time horizon (6 months, a year etc.)  The calculation looks something like:

Total Site Conversion – Control Group Conversion – cost of the testing program.

Of course, this may not be so easy to do, as forming a clean control group will often be complicated, if not impossible, and determining how to value the lift in conversion can also be tricky in most non commerce applications.

2) The Downside – mitigating Type 1 loss.
A major limitation of the above approach is that it ignores a major reason for testing – the mitigation of Type 1 errors. Type 1 errors are changes or behaviors that lead to harm, or loss. To estimate the value of mitigating this possible loss, we would need to expose to our control group the changes that we WOULD have made on the site had they not been rejected. So we need to make all the changes that we think, based on our test results, will harm the customer experience. Of course this is almost certainly a bad idea, and it highlights why certain types of questions are not amenable to randomized controlled trials (RCT) that are the backbone of AB Testing (anyone out there using instrumental variables or other counterfactual methods? Please comment if you are).

(for a refresher on Type 1 and Type 2 errors please see https://conductrics.com/do-no-harm-or-ab-testing-without-p-values/)

But even if we did go down this route (bad idea), it still doesn’t get us a proper estimate of the true value of testing, since even if we don’t encounter harmful events, we still had protection against them.  For example, you may have collision car insurance but have had no accidents over the course of a year. What was the value of the collision insurance, zero? The value of insurance isn’t equal to what is paid out. Insurance doesn’t work like that – it is good that is consumed regardless if it pays out or not. What you are paying for is the reduction in downside risk – and that is something that testing provides regardless of if the adverse event pays out or not.  The difficult part for you is to assess the probability (is this risk or Knightian uncertainty), and severity of the potential negative results.

The value of testing is in both optimization (finding the improvements); and in mitigating downside risk. To value the latter, we need to be able to price what is essentially a form of insurance against whatever the org considers to be intolerable downside risk. So it is sort of like asking what is the value of insurance, or privacy policies, or security policies. You may get by in any given year without them, but as you scale up, the risks of downside events grow, making it more and more likely that a significant adverse event will occur.

One last thing. Testing programs tend to jumble together the related, but separate concepts of hypothesis testing  – binary Reject/Fail to Reject outcome –  with the estimation of effect sizes – the best guess of the ‘true’ population conversion rates.  I mention this because often we just think about the value of the actions taken based the hypothesis tests, rather than also considering the value of robust estimates of the effect sizes for forecasting, ideation, and for helping allocate future resources  (as an aside, one can run an experiment that has a robust hypotheses test, but also yields a biased estimate of the effect size (magnitude error). [Sequential testing I’m looking at you!]

Ultimately, testing can be seen as both a profit (upside discovery) AND cost (downside mitigation) center.  Just focusing on one will lead to underestimating the value your testing program can provide to the organization.  That said, it is a fair question to ask, and one that hopefully will help lead to extracting even more value from your experimentation efforts.

What are your thoughts? Is there any thing we are missing, or should consider? Please feel free to comment and let us know how you value your testing program.


Post a Comment

Your email is never published nor shared. Required fields are marked *