**1) Do No Harm**

- an existing process, or customer experience, that the organization is counting on for at least a certain minimum level of performance.
- a direct cost associated with implementing a change.

**2) Go For It**

A good example is headline optimization for news articles. Each news article is, by definition, novel, as are the associated headlines. Assuming that one has already decided to run headline optimization (which is itself a ‘Do No Harm’ question), there is no added cost, or risk to selecting one or the other headlines when there is no real difference in the conversion metric between them. The objective of this type of problem is to maximize the chance of finding the best option, if there is one. If there isn’t one, then there is no cost or risk to just randomly select between them (since they perform equally as well and have the same cost to deploy). As it turns out, Go For It problems are also good candidates for Bandit methods.

**State of the World**

- There isn’t the expected effect/difference between the options
- There is the expected effect/difference between the options

**Beliefs**about the world are not consistent with the true state of the world.

**Payoff: Do No Harm The True State of the World (unknown)**

Decision |
Expected Effect |
No Expected Effect |

Pick A | Opportunity Costs | No Cost |

Pick B | No Opportunity Cost | Cost |

**Payoff: Go For It The True State of the World (unknown)**

Decision |
Expected Effect |
No Expected Effect |

Pick A | Opportunity Costs | No Cost |

Pick B | No Opportunity Cost | No Cost |

Notice that the payoff tables for Do No Harm and Go For It are the same when the true state of the world is that there is an effect. But, they differ when there is no effect. When there is no effect, there is NO relative cost in selecting either A or B.

**Why is this way of thinking about optimization problems useful?**

However, it also means that if we have a ‘Go For It’ problem, if there is no effect, we don’t really care how we make our selections. Picking randomly when there is no effect is fine, as each of the options have the same value. It is this case where our simple test of just picking the highest value option makes sense.

## Go For It: Tests with no P-Values

1) Calculate the sample size

2) Collect the data

3) **Pick whichever option has the highest raw conversion value.** If a tie, flip a coin.

Calculate the sample size almost exactly the same as in a standard test: 1) pick a minimum detectable effect (MDE) – this is our minimum desired lift; 2) select the power of the test.

Ah, I hear you asking ‘What about the alpha, don’t we need to select a confidence level?’ Here is the trick. We want to select randomly when there is no effect. By setting alpha to 0.5, the test Reject the null 50% of the time when null is true (no effect).

Lets go through a simple example to make this clear. Lets say your landing page tends to have a conversion rate of around 4%. You are trying out a new alternative offer, and a meaningful improvement for your business would a lift to a 5% conversion rate. So the minimum detectable effect (MDE) for the test is 0.01 (1%).

We then estimate the sample size needed to find the MDE if it exists. Normally, we pick an alpha of 0.05 , but now we are instead going to use an alpha of 0.5. The final step is to pick the power of the test, lets use a good one, 0.95 (often folks pick, 0.8, but for this case we will use 0.95).

You can use now use your favorite sample size calculator (for Conductrics users this is part of the set up work flow).

If you use R, this will look something like:

power.prop.test(n = NULL, p1 = 0.04, p2 = 0.05, sig.level = 0.5, power = .95,

alternative =”one.sided”, strict = FALSE)

This gives us a sample size of 2,324 per option, or 4,648 in total. If we were to run this test with a confidence of 95% (alpha=0.05) would need to have almost four times the traffic, 9,299 per options, or 18,598 in total.

The following is a simulation of 100K experiments, were each experiment selected each arm 2,324 times. The conversion rate for B was set to 5% and 4% for A. The chart below plots the difference in the conversion rates between A and B. Not surprisingly, it is centered on the true difference of 0.01. The main thing to notice, is that if we pick the option with the highest conversion rate we pick B 95% of the time, which is exactly the power we used to calculate the sample size!

**What about Bandits?**

### The Weeds

You may be hesitant to believe that such a simple rule can accurately help detect an effect. I checked in with Alex Damour, the Neyman Visiting Assistant Professor over at UC Berkeley and he pointed out that this simple approach is equivalent to running a standard t-test of the following form. From Alex:

“Find N such that P(meanA-meanB < 0 | A = B + MDE) < 0.05. This is equal to the N needed to have 95% power for a one-sided test with alpha = 0.5.

Proof: Setting alpha = 0.5 sets the rejection threshold at 0. So a 95% power means that the test statistic is greater than zero 95% of the time under the alternative (A = B + MDE). The test statistic has the same sign as meanA-meanB. So, at this sample size, P(meanA – meanB > 0 | A = B + MDE) = 0.95.”

### A Few Final Thoughts

**HARD**– but that is what makes all the effort required to learn worth it!