**1) Do No Harm**

- an existing process, or customer experience, that the organization is counting on for at least a certain minimum level of performance.
- a direct cost associated with implementing a change.

**2) Go For It**

A good example is headline optimization for news articles. Each news article is, by definition, novel, as are the associated headlines. Assuming that one has already decided to run headline optimization (which is itself a ‘Do No Harm’ question), there is no added cost, or risk to selecting one or the other headlines when there is no real difference in the conversion metric between them. The objective of this type of problem is to maximize the chance of finding the best option, if there is one. If there isn’t one, then there is no cost or risk to just randomly select between them (since they perform equally as well and have the same cost to deploy). As it turns out, Go For It problems are also good candidates for Bandit methods.

**State of the World**

- There isn’t the expected effect/difference between the options
- There is the expected effect/difference between the options

**Beliefs**about the world are not consistent with the true state of the world.

**Payoff: Do No Harm The True State of the World (unknown)**

Decision |
Expected Effect |
No Expected Effect |

Pick A | Opportunity Costs | No Cost |

Pick B | No Opportunity Cost | Cost |

**Payoff: Go For It The True State of the World (unknown)**

Decision |
Expected Effect |
No Expected Effect |

Pick A | Opportunity Costs | No Cost |

Pick B | No Opportunity Cost | No Cost |

Notice that the payoff tables for Do No Harm and Go For It are the same when the true state of the world is that there is an effect. But, they differ when there is no effect. When there is no effect, there is NO relative cost in selecting either A or B.

**Why is this way of thinking about optimization problems useful?**

However, it also means that if we have a ‘Go For It’ problem, if there is no effect, we don’t really care how we make our selections. Picking randomly when there is no effect is fine, as each of the options have the same value. It is this case where our simple test of just picking the highest value option makes sense.

## Go For It: Tests with no P-Values

1) Calculate the sample size

2) Collect the data

3) **Pick whichever option has the highest raw conversion value.** If a tie, flip a coin.

Calculate the sample size almost exactly the same as in a standard test: 1) pick a minimum detectable effect (MDE) – this is our minimum desired lift; 2) select the power of the test.

Ah, I hear you asking ‘What about the alpha, don’t we need to select a confidence level?’ Here is the trick. We want to select randomly when there is no effect. By setting alpha to 0.5, the test Reject the null 50% of the time when null is true (no effect).

Lets go through a simple example to make this clear. Lets say your landing page tends to have a conversion rate of around 4%. You are trying out a new alternative offer, and a meaningful improvement for your business would a lift to a 5% conversion rate. So the minimum detectable effect (MDE) for the test is 0.01 (1%).

We then estimate the sample size needed to find the MDE if it exists. Normally, we pick an alpha of 0.05 , but now we are instead going to use an alpha of 0.5. The final step is to pick the power of the test, lets use a good one, 0.95 (often folks pick, 0.8, but for this case we will use 0.95).

You can use now use your favorite sample size calculator (for Conductrics users this is part of the set up work flow).

If you use R, this will look something like:

power.prop.test(n = NULL, p1 = 0.04, p2 = 0.05, sig.level = 0.5, power = .95,

alternative =”one.sided”, strict = FALSE)

This gives us a sample size of 2,324 per option, or 4,648 in total. If we were to run this test with a confidence of 95% (alpha=0.05) would need to have almost four times the traffic, 9,299 per options, or 18,598 in total.

The following is a simulation of 100K experiments, were each experiment selected each arm 2,324 times. The conversion rate for B was set to 5% and 4% for A. The chart below plots the difference in the conversion rates between A and B. Not surprisingly, it is centered on the true difference of 0.01. The main thing to notice, is that if we pick the option with the highest conversion rate we pick B 95% of the time, which is exactly the power we used to calculate the sample size!

**What about Bandits?**

### The Weeds

You may be hesitant to believe that such a simple rule can accurately help detect an effect. I checked in with Alex Damour, the Neyman Visiting Assistant Professor over at UC Berkeley and he pointed out that this simple approach is equivalent to running a standard t-test of the following form. From Alex:

“Find N such that P(meanA-meanB < 0 | A = B + MDE) < 0.05. This is equal to the N needed to have 95% power for a one-sided test with alpha = 0.5.

Proof: Setting alpha = 0.5 sets the rejection threshold at 0. So a 95% power means that the test statistic is greater than zero 95% of the time under the alternative (A = B + MDE). The test statistic has the same sign as meanA-meanB. So, at this sample size, P(meanA – meanB > 0 | A = B + MDE) = 0.95.”

### A Few Final Thoughts

**HARD**– but that is what makes all the effort required to learn worth it!

## 5 Comments

The key point about about A/B testing not mentioned is GENERALIZATIONS. When you run many experiments, you want to learn to generalize. When you just ship low confidence winners, it’s hard to generalize the learnings. Even in go-for-it scenarios like picking between two news article titles, there’s room to learn which style of headlines works better.

Thanks for the comment Ronny. Yep, of course there are many different use cases for running experiments. I wouldn’t suggest using this simple approach for most problems. I do think that it is interesting that by just appealing to the NP-lemma, and then using a trivial decision rule one can get similar results (at least in the simple two arm cases) to near optimal bandit methods (optimal wrt regret).

Nice job! This post is great at explaining that not all tests are equal towards risk management.

But I found it quite complex to explain to non data-scientist people, this is due to the pValue concept which is counter intuitive.

Why don’t you explain this idea in a Bayesian framework ?

The provided confidence interval around the lift, is a more intuitive way to understand risk/gain trade off.

Hi Hubert – thanks for the kind words.

Re Bayesian: I did run some simulations using MC comparisons of the predicted posterior distributions of each arm. I didn’t include them bc: 1) I felt that the post was already pretty dense; 2) We are really asking a frequentist question here (I think it is fair to say we are implicitly invoking the Neyman-Pearson lemma); 3) I wasn’t sure what Bayesian framework to use – I was just estimating the probability that B>A, but I guess one could argue a Bayesian framework should use Bayes Factors (BF). I think that introducing BFs would just muddy the waters more.

I guess I could have another short post, and just walk through the results of the Posterior simulations – but again, since we are asking a freq question, the results don’t really differ, but maybe that would be of interest.

Putting aside the question of what is the goal of experimentation – yes/no decisions or estimation, I think the above approach has several issues.

First, it is not “without p-values”. It uses a p-value of 0.5, just as Prof. Damour confirms in the article itself.

Once we have that down, it becomes clear that type I and type II are reversed – type I > type II. Type I should be the error we most want to avoid, type II should be a less severe error, so the hypotheses should be defined such that Type I B”, while the alternative covers “A <= B". Thus we explicitly state that the error we want to avoid the most is "failure to implement a superior solution", not the usual "implementing a non-superior solution". I'm not sure many people would subscribe to such an approach for obvious reasons.

I have written about non-inferiority tests which are suitable in situations called "go for it" above, and "easy decisions" in my work. In it, however, the primary error is changed from "implementing a non-superior solution" to "implementing an inferior solution", which I believe better addresses the scenario posed in the article.

The major issue with the proposed approach is that when viewed from the point of balancing risk and reward, I could not identify sample size (duration), prior expectation of the result and duration of exploitation of the test outcome, in which using 50% significance and 95% power as defined above, is optimal. Not even an unrealistic one. There could be one, but despite my efforts, I could not find a situation in which using this rule leads to a risk/reward ratio that is optimal. I've used my A/B testing ROI calculator to calculate the risk and reward with $0 costs to perform the test itself, $0 implementation, $0 maintenance costs, and a symmetrical prior distribution for the expected effect, reflecting the "go for it" scenario. If anyone has a better idea of how to calculate risk and reward, I'd be happy to see it.

The claim ""…we still get all of our power goodness! And we only needed about a fourth of the data to reach that power." can be misleading. Since power is derived from the significance threshold and sample size, claiming that you reach the same power level with a lower sample size is not warranted. Power is defined as probability to detect a true effect at a given significance level. Drop significance from 95% to 50% and having 95% power is not the same as having 95% power. It becomes an apples to oranges comparison.

"This test guarantees that if there is a true difference of the minimum discernible effect (MDE), or larger, one will choose the better-performing arm X% of the time, where X is the power of the test." – should be modified slightly to avoid confusion: either drop the "or larger" part, or replace "arm X% of the time" with "arm at least X% of the time".