## AB Testing: Ruling Out To Conclude

Seemingly simple ideas underpinning AB Testing are confusing. Rather than getting into the weeds around the definitions of p-values and significance, perhaps AB Testing might be easier to understand if we reframe it as a simple ruling out procedure.

#### Ruling Out What?

There are two things we are trying to rule out when we run AB Tests

1. Confounding
2. Sampling Variability/Error

Confounding is the Problem Random Selection is a Solution

What is Confounding?

Confounding is when unobserved factors that can affect our results are mixed in with the treatment that we wish to test. A classic example of potential confounding is the effects of education on future earnings. While people who have more years of education tend to have higher earnings, a question Economists like to ask is if extra education drives earnings or if natural ability, which is unobserved, causes how may years of education and amount of earnings people receive. Here is a picture of this:

We want to be able to test if there is a direct causal relationship between education and earnings, but what this simple DAG (Direct Acyclic Graph) shows is that education and earnings might be jointly determined by ability – which we can’t directly observe. So we won’t know if it is education that is driving earnings or if earnings and education are just an outcome of ability.

The general picture of confounding looks like this:

What we want is a way to break the connection between the potential confounder and the treatment.

Randomization to the Rescue
Amazingly, if we are able to randomize which subjects are assigned to each treatment we can break, or block, the effect of unobserved confounders and we can make causal statements about the treatment on the outcome of interest.

Why? Since the assignment is done based on random draw, the user, and hence any potential confounder is no longer mixed in with the treatment assignment. You might say that the confounder no longer gets to choose its treatment. For example, if we were able to randomly assign people to education, then high and low ability students each would be just as equally likely to be in the low and high education groups, and their effect on earnings would balance out, on average, leaving just the direct effect of education on earnings. Random assignment lets us rule out potential confounders, allowing us to focus just on the causal relationship between treatment and outcomes*.

So are we done? Not quite. We still have to deal with uncertainty that is introduced whenever we try to learn from sample observations.

Sampling Variation and Uncertainty

Analytics is about making statements about the larger world via induction – the process of observing samples from the environment, then applying the tools of statistical inference to draw general conclusions. One aspect of this that often goes underappreciated is that there is always some inherent uncertainty due to sample variation.  Since we never observe the world in its entirety, but only finite, random samples, our view of it will vary based on the particular sample we use. This is reason for the tools of statistical inference – to account for this variation when we try to draw out conclusions.

A central idea behind induction/statistical inference is that we are only able to make statements about the truth within some bound, or range,  and that bound only holds in probability.

For example, the true value is represented as the little blue dot But this is hidden from us.

Instead what we are able to learn is something more like a smear.

The smear tells us that the true value of the thing we are interested in will lie somewhere between x and x’ with some P probability. So there is some P probability, perhaps 0.05, that our smear won’t cover the true value.

This means that there are actually two inter related sources of uncertainty:

1) the width, or precision, of the smear (more formally called a bound)

2) the probability that the true value will lie within the smear rather than outside of its upper and lower range.

Given a fixed sample (and a given estimator), we can reduce the width of the smear (make it tighter, more precise), only by reducing the probability that the truth will lie within it  – and vice versa, we can increase the probability that the truth will lie in the smear only by increasing (make it looser, less precise) its width. This is a more general concept that the confidence interval is an example of – we say the treatment effect is likely within some interval (bound) with a given probability (say 0.95). We will always be limited in this way. Yes we can decrease the width, and increase the probability that it holds by increasing our sample size, but it is always with diminishing returns [in the order of O(1/sqrt(n)].

AB Tests and P-values To Rule Out Sampling Variations

Assuming we have collected the samples appropriately, and certain assumptions hold, by removing potential confounders there will now be just two potential sources of variation between our A and B interventions:

1) the inherent sampling variation that is always part of sample based inference that we discussed earlier; and
2) a causal effect – the effect on the world that we hypothesize exists when doing B vs A.

AB tests are a simple, formal process to rule out, in probability, the sampling variability.  Through the process of elimination if we rule out the sampling variation as the main source of the observed effect (with some P probability), then we might conclude the observed difference is due to a causal effect. The  P-value ( the probability of seeing the observed difference, or greater, just due to random sampling) – relates to the probability that we will tolerate in order to rule out that the sampling variation is a likely source for the observed difference.

For example, in the first case we might not be willing to rule out sampling variability, since our smears overlap with one another – indicating that the true value of each might well be covered by either smear.

However in this case, where our smears are mostly distinct from one another, we have little evidence that the sampling variability is enough to lead to such a difference between our results and hence we might conclude the difference is due to a causal effect.

So we look to rule out in order to conclude**

To summarize, causal statements, via AB Tests/RCTs, randomize treatment selections to generate random samples from each treatment in order to block confounding so that we can safely use the tools from statistical inference to make causal statements .

* RCTs are not the only way to deal with confounding. When studying the effect of education on earnings, un able to run RCTs, Economists used the method of instrumental variables to try to deal with confounding in observational data.

**technically ‘reject the null’ – think of Tennis if ‘null’ trips you up – it’s like zero. We ask ‘Is there evidence, after we account for the likely difference due to sampling’ to reject that the difference we see, e.g the observed difference in conversion rate between B and A, is likely due to just sampling variations.

***If you want to learn about other ways of dealing with confounding beyond RCTs a good introduction is Causal Inference: The Mixtape – by Scott Cunningham.

## Some are Useful: AB Testing Programs

As AB testing becomes more commonplace, companies are moving beyond thinking about how to best run experiments to how to best set up and run experimentation programs. Unless the required time, effort, and expertise is invested into designing and running the AB Testing program, experimentation is unlikely to be useful.

Interestingly, some of the best guidance for getting the most out of experimentation can be found in a paper published almost 45 years ago by George Box. If that name rings a bell, it is because Box is attributed with coining the phrase “All models are wrong, but some are useful”. In fact, from the very same paper that this phrase comes from we can discover some guiding principles for running an a successful experimentation program.

In 1976 Box published Science and Statistics in the Journal of the American Statistical Association. In it he discusses what he considers to be the key elements to successfully applying the scientific method. Why might this be useful for us? Because in a very real sense, experimentation and AB Testing programs are the way we implement the scientific method to business decisions. They are how companies DO science. So learning about how to best employ the scientific method directly translates to how we should best set up and run our experimentation programs.

Box argues that the scientific method is made up, in part, of the following:
1) Motivated Iteration
2) Flexibility
3) Parsimony
4) Selective Worry

According to Box, the attributes of the scientific method can best thought of as “motivated iteration in which, in succession, practice confronts theory, and theory, practice.” He goes on to say that, “Rapid progress requires sufficient flexibility to profit from such confrontations, and the ability to devise parsimonious but effective models [and] to worry selectively …”.

Let’s look at what he means in a little more detail and how it applies to experimentation programs.

### Learning and Motivated Iteration

Box argues that learning occurs through the iteration between theory and practice. Experimentation programs formalize the process for continuous learning about marketing messaging, customer journeys, product improvements, or any other number of ideas/theories.

Box: “[L]earning is achieved, not by mere theoretical speculation on the one hand, nor by the undirected accumulation of practical facts on the other, but rather by a motivated iteration between theory and practice. Matters of fact can lead to a tentative theory. Deductions from this tentative theory may be found to be discrepant with certain known or specially acquired facts. These discrepancies can then induce a modified, or in some cases a different, theory. Deductions made from the modified theory now may or may not be in conflict with fact, and so on.”

As part of the scientific method, experimentation of ideas naturally requires BOTH a theory about how things work AND the ability to collect facts/evidence that may or may not support that theory. By theory, in our case, we could mean an understanding of what motivates your customer, why they are your customer and not someone else’s, and what you might do to ensure that they stay that way.

Many times marketers purchase technology and tools in an effort to better understand their customers. However, without a formulated experimentation program, they are missing out on one half of the equation. The main takeaway is that just having AB Testing and other analytics tools are not going to be sufficient for learning. It is vital for YOU to also have robust theories about customer behavior, what they care about, and what is likely to motivate them. The theory is the foundation and drives everything else. It is then through the iterative process of guided experimentation, that then feeds back on the theory and so on, that we establish a robust and useful system for continuous learning.

### Flexibility

Box;  “On this view efficient scientific iteration evidently requires unhampered feedback. In any feedback loop it is … the discrepancy between what tentative theory suggests should be so and what practice says is so that can produce learning. The good scientist must have the flexibility and courage to seek out, recognize, and exploit such errors … . In particular, using Bacon’s analogy, he must not be like Pygmalion and fall in love with his model.”

Notice the words that Box uses here: “unhampered” and “courage”. Just as inflexible thinkers are unable to consider alternative ways of thinking, and hence never learn, so it is with inflexible experimentation programs. Just having a process for iterative learning is not enough. It must also be flexible. By flexible Box doesn’t only mean it must be efficient in terms of throughput. It must also allow for ideas and experiments to flow unhampered, where neither influential stakeholders nor the data science team holds too dearly to any pet theory. People must not be afraid of creating experiments that seek to contradict existing beliefs, nor should they fear reporting any results that do.

### Parsimony

Box: ”Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary, following William of Occam [we] should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so over elaboration and overparameterization is often the mark of mediocrity.”

This is where the  “All Models are Wrong” saying comes from! I take this to mean that rather than spend effort seeking the impossible, we should instead seek what is most useful and actionable –  “how useful is this model or theory in helping to make effective decisions?”

In addition, we should try to keep analysis and experimental methods as simple as required for the problem.  Often companies can get distracted, or worse, seduced by a new technology or method that adds complexity without advancing the cause.  This is not to say that more complexity is always bad, but whatever the solution is, it should be the simplest one that can do the job. That said, the ‘job’ may be really for signaling/optics rather than to solve a specific task. For example, to differentiate a product or service as more ‘advanced’ than the competition, regardless if it actually improves outcomes. It is not for me to say if those are good enough reasons for making something more complex, but I do suggest being honest about it and going forward forthrightly and with eyes wide open.

### Worry Selectively

Box: “Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.”

This is my favorite line from Box. Being “alert to what is importantly wrong” is perhaps the most fundamental and yet underappreciated analytic skill. It is so vital not just in building an experimentation program but for any analytics project to be able to step back and ask “while this isn’t exactly correct, will it matter to the outcome, and if so by how much?” Performing this type of sensitivity analysis, even if informally in your own mind’s eye, is an absolutely critical part of good analysis. You don’t have to be an economist to think, and decide at the margin.

Of course if something is a mouse or a tiger will depend on the situation and context. That said, in general, at least to me, the biggest tiger in AB Testing is fixating on solutions or tools before having defined the problem properly. Companies can easily fall into the trap of buying, or worse, building a new testing tool or technology without having thought about: 1) exactly what they are trying to achieve; 2) the edge cases and situations where the new solution may not perform well; and 3) how the solution will operate within the larger organizational framework.

As for the mice, they are legion.  They have nests in all the corners of any business, whenever spotted causing people to rush from one approach to another in the hopes of not being caught out.  Here are a few of the ‘mice’ that have scampered around AB Testing:

• One Tail vs Two Tails (eek! A two tailed mouse – sounds horrible)
• Bayes vs Frequentist AB Testing
• Fixed vs Sequential designs
• Full Factorial Designs vs Taguchi designs

There is a pattern here. All of these mice tend to be features or methods that were introduced by vendors or agencies as new and improved, frequently over-selling their importance, and implying that some existing approach is ‘wrong’.  It isn’t that there aren’t often principled reasons for preferring one approach over the other. In fact, often, all of them can be useful (except for maybe Taguchi MVT – I’m not sure that was ever really useful for online testing) depending on the problem. It is just that none of them, or others, will be what makes or breaks a program’s usefulness.

The real value in an experimentation program are the people involved, and the process and culture surrounding it – not some particular method or software.  Don’t get me wrong, selecting software and the statistical methods that are most appropriate for your company matters, a lot, but it isn’t sufficient.  I think what Box says about the value of the statistician should be top of mind for any company looking to run experimentation at scale:
“… the statistician’s job did not begin when all the work was over-it began long before it started. …[The Statistician’s] responsibility to the scientific team was that of the architect with the crucial job of ensuring that the investigational structure of a brand new experiment was sound and economical.”

So too for companies looking to include experimentation into their workflow. It is the experimenter’s responsibility to ensure that the experiment is both sound and economical and it is the larger team’s responsibility to provide an environment and process, in part by following Box, that will enable their success.

## Conductrics and ITP

What’s the impact on your Conductrics implementation?

As you are likely aware, many browsers have begun to restrict the ability to “track” visitor behavior, in an effort to protect visitor privacy. The focus is especially on third-party scripts that could track users as they move between totally unrelated sites on the Internet.

Apple’s efforts in this regard are particularly well-known, with the introduction of their Intelligent Tracking Prevention (ITP) policies.

• ITP has gone through several revisions, each placing additional restrictions on how visitor information can be stored and for how long.
• While ITP is most well-known for affecting Safari users on iPhone and iPad, it also affects other browsers on those devices, such as Firefox for iOS / iPadOS. Safari users on the Mac are also affected.

While Conductrics has never engaged in any sort of visitor-tracking outside of your own agents/tests, ITP and other similar restrictions are now a fact of life. There will be some impact on what you can do with Conductrics (or other similar service).

## When Using Client-Side Conductrics in Web Pages

When you use Conductrics Express or our Local JavaScript API in your pages, your Conductrics tests/agents do their work locally in the browser. They also store their “visitor state” information locally in the browser.

This visitor state information includes the variation assignments per agent (such as whether the visitor has been selected for the “A” or “B” variation for each of your tests), and some other information such as which goals/conversions have been reached and whether any visitor traits have been set.

ITP says that this visitor state information will be cleared between a user’s visits to your pages, if more than 7 days have passed since their last visit. For visit types, the visitor state will be cleared after one day (the one-day rule is triggered if visitors arrive at your site via link decoration, common in social media campaigns).

What does this all mean?

• If a visitor goes 7 days (*) or more between visits, your client-side Conductrics implementation will see the visitor as a “new” visitor.
• So, after 7 days (*), the visitor might get a different variation than on their previous visit(s). They would also be counted again in the Conductrics reporting.
• If the visitor hits a goal/conversion event, but it’s been 7 days (*) since they were exposed to a variation for a given test/agent, the goal/conversion will not be counted.

How should we change our testing or optimization practice?

• For A/B or MVT testing, try to focus on tests that would reasonably be expected to lead to a conversion event within 7 days (*).
• Generally speaking, focus on tests or optimizations where it wouldn’t be “absolutely terrible” if the visitor were to be re-exposed to a test, possibly getting a different variation.
• You could consider using rules/conditions within your Conductrics agents such that Safari browsers are excluded from your tests. However, that will likely reduce the number of visitors that you do expose (particularly on mobile), probably requiring you to run the test for longer, and also possibly “skewing” your results since you’d be likely excluding most visitors on Apple devices.
• You could consider looking at ITP vs non-ITP browsers (or iOS / iPadOS vs non) when evaluating the results of a test, to see if there are any important, noticeable differences between the two groups for your actual visitors.
• Conversely, Conductrics could be configured to treat all visitors as if they were subject to ITP’s “7 day rule” (or one day), even on browsers that don’t currently impose such restrictions by default (thus leveling the field between ITP and non-ITP browsers). Contact Conductrics to discuss.

(*) The “7 day rule” might actually be one day, depending on whether the visitor lands on your site via “link decoration” (that is, via a URL that has identifiers included in query parameters or similar). See this article from the WebKit team for details.

Q: How is the Visitor State information stored?

A: By default, client-side Conductrics stores its visitor state information in the browser’s “Local Storage” area. Alternatively, it can be configured to instead store the information as a cookie. The main reason to choose the cookie option is to allow the visitor state to be shared between subdomains. ITP is in effect in equal measure regardless of whether your Conductrics implementation is set to use Local Storage or Cookies.

Q: What does Conductrics do to work around ITP?

A: We don’t try to defeat or “work around” ITP or similar policies. The WebKit team has made it very clear that ITP is a work in progress and will continue to address any loopholes. Rather than try to implement workarounds that would likely be short-lived, we think our customers will be better served if we focus on helping you make sense of your test results in an ITP world.

Q: Which browsers are affected by ITP?

A: Technically, ITP is implemented in WebKit browsers, which includes Safari on mobile and desktop. That said, other browsers such as Firefox have similar policies, so often the term “ITP” is used colloquially to refer to any browser restrictions on cookies and other browser-based storage and tracking.

## When Using the Conductrics REST API

All of the above is about using client-side Conductrics. If you use the Conductrics REST-style API, the “visitor state” information is retained on the Conductrics side, using an identifier that you pass in when calling the API.

Because the identifier is “on your side” conceptually, whether your REST API tests are affected by ITP will depend on how you store the identifier that you provide to Conductrics.

• If the identifier is associated with a “logged in” or “authenticated” session, it is probably stored in some way on your side such that it can live for a long time, and thus would not be affected by ITP.
• If the identifier is being stored as a first-party cookie by your client-side code, it is also subject to ITP or similar policies, so your Conductrics tests will be affected to more or less the same degree as discussed above. However, if the identifier is being stored as “server cookie” only (with the HttpOnly and Secure flags on the cookie itself), then it is probably not affected by ITP.
• For native mobile apps (or other devices or systems such as kiosks, set-tops, or IVRs), you probably have an identifier that is stored “natively”, without having anything to do with web pages, so your implementation would probably not be affected by ITP.

### Questions?

As always, feel free to contact Conductrics regarding ITP, visitor data, or any other questions you may have. We’re here to help!

## Conductrics Announces Search Discovery as a Premier Partner

December 1, 2020 – Austin, Texas –

Conductrics, a digital experimentation and artificial intelligence (AI) SaaS company, announces its partnership with Search Discovery, a premier data transformation company. Search Discovery will now offer Conductrics optimization technology along with industry-leading optimization consulting support.

Together, the two companies will offer clients a superior integrated solution that satisfies a market need: Clients across industries are searching for both technological solutions and strategic guidance to help drive their internal innovation and growth. This partnership will make it simple for clients to work smarter and faster, use better experimentation techniques, and leverage both Conductrics’ and Search Discovery’s core competencies to build best-in-class optimization and experimentation programs.

Conductrics offers robust and flexible experimentation software that supports the specific requirements of Marketing, Product, and IT departments. Teams are able to seamlessly manage and quickly deploy their experiments using Conductrics’ integrated communication and implementation tools.

Search Discovery provides strategic consulting for clients to manage and run successful experimentation and personalization programs at scale.

“Aside from the natural business fit, our two teams work well together. The expert team at Search Discovery has an impressive track record of helping A-list clients build and grow world-class optimization programs,” comments Matt Gershoff, Conductrics’ co-founder and CEO. “This new partnership will enable us to provide clients with the optimal combination of technology and experimentation expertise.”

“The Conductrics platform has optimal flexibility, transparency, and the power needed to help us support our clients’ data-driven decision-making across every digital experience—even in today’s increasingly complex privacy environment,” says Kelly Wortham, Search Discovery’s Senior Optimization Director. “The Conductrics team’s ability to quickly customize the platform with our clients’ rapidly changing requirements makes this partnership even more exciting for Search Discovery.”

In 2010, Conductrics released one of the industry’s first REST APIs for delivering AB Testing, multi-arm bandits, and predictive targeting to empower both Marketing and IT professionals. With Conductrics, marketers, product managers, and consumer experience stakeholders can quickly and easily optimize the customer journey, while IT departments benefit from the platform’s simplicity, ease of use, and integration with existing technology stacks. Visit Conductrics at www.conductrics.com

Search Discovery is a data transformation company that helps organizations use their data with purpose to drive measurable business impact. Their services and solutions help global organizations at every stage of data transformation, including strategy, implementation, optimization, and organizational change management. Search Discovery delivers efficient operations, deeper insights, and improved-decision making across marketing, sales, finance, operations, and human resources. Visit searchdiscovery.com.

## Video: Conductrics Platform Overview

##### The new features that are demonstrated include:
• A revamped and simplified user experience
• Streamlined A/B Testing/Optimization workflows
• Improved Program Management tools
• Upgraded API and mobile developer libraries

## Conductrics Announces Updated Release of its SaaS Experimentation Platform

SEPTEMBER 24, 2020 – AUSTIN, TX – Digital experimentation and Artificial Intelligence (AI) SaaS company Conductrics, today announced its latest major release of its experience optimization platform built expressly for marketers, developers, and IT professionals. The updated platform, which shares the company’s name, is a cloud-based A/B testing and decision optimization engine. This latest release focuses on streamlining and improving the user experience for faster and easier execution of optimization programs.

These upgrades make it easier for clients to scale optimization programs across different organizational and functional teams in order to deliver ideal digital experiences for their customers. The new features include:

• A revamped and simplified user experience (UX),
• Streamlined A/B Testing and Optimization workflows,
• Improved Program Management tools,
• Upgraded API and mobile developer libraries.

“Since our start in 2010, our goal has been to make it faster and easier for developers and marketers to work together in order to confidently discover and deliver the best customer experiences,” says Matt Gershoff, Conductrics’ co-founder and CEO. “As other technologies morph and become increasingly more complex, we remain focused on developing accessible, leading-edge optimization and experimentation technology.”

The new release will be available in mid-October.  Current clients will have the option to use the legacy platform or the new platform – no action is needed on their part. A webinar will be held on October 13th to demonstrate the new features and benefits – a link to register is on the company website.

In 2010, Conductrics released one of the industry’s first REST APIs for delivering AB Testing, multi-arm bandits and predictive targeting to empower both Marketing and IT professionals. With Conductrics, marketers, product managers, and consumer experience stakeholders can quickly and easily optimize the customer journey, while IT departments benefit from the platform’s simplicity, ease of use, and integration with existing technology stacks. Visit Conductrics at www.conductrics.com.

## Headline Optimization at Scale: Conductrics Macros

Conductrics has a history of providing innovative approaches to experimentation, testing and optimization.  In 2010, we introduced one of the industry’s first REST API for experimentation which also supported multi-arm bandits and predictive targeting.  Continuing our goal to provide innovative solutions, we’ve developed Express Macros and Templates to safely and easily build tests at scale.

To illustrate, let’s say you frequently run headline testing on a news page or change certain areas of a landing page over and over. In such situations, you don’t want test authors to have full page-editing capabilities – you just want them to have access to specific sections of the page or the site. Additionally, because variations of the same basic test will be conducted over and over, it is imperative that the test setup is simple, easy to repeat and scaleable.

### Conductrics Templates and Express Macros

Conductrics Templates and Express Macros are an answer to this. They are an easy way to create reusable forms that allow your team to set up and run multiple versions of similar tests just by filling out a simple form and a click of a button.

### EXAMPLE: When to Use Conductrics Express Macros

One of our national media clients wanted to optimize News headlines for their news homepage. This meant that rather than just provide a single Headline for each news article, the client wanted to try out several potential Headlines for each story, see which worked best, and then use the winning headline going forward. To do this, they wanted to take advantage of Conductrics multi-armed bandits, which automatically discover, and deploy the most effective Headline for each story.

However, they have scores of articles each day, so they needed to be able to run hundreds of these tests every month. In addition, these tests needed to be set up by multiple, non-technical editors safely, so as to not risk breaking the homepage.

This is where Express Macros helped. Macros let the client extend and customize the Conductrics test creation process by:

1) creating simple custom forms to make it easy to set up each test; and

2) applying custom JavaScript to that form in order for the test to execute properly.

### How the Express Macro Works

For example, this macro will create a form with two input fields, “Post Id” and “Article Headline”. You can, of course, create any number of fields that are needed.

Now that we have specified what data to request from the test creator, we now just need to provide the JavaScript that will use the values taken in by the Macro’s form to run the test. In this case we will want to use the ‘Post Id’ (alternatively, this could be a URL or some other identifier) to tell Conductrics on which article to run the test. We also include in our JavaScript logic to swap in the alternative Headline(s) for the test.

Here is an example of what that might look like:

While this might look complicated if you don’t know JavaScript, don’t worry, this is something any front-end developer can do easily (or you can just ask us for help).

All that there is left to do is to name and save it. I have named it ‘Headline Optimization’.

There is just one last step before we can let our Headline editors start to run tests, and that is to assign the Macro to a Template.

### Template Example

Express Macros was developed to bridge the workflows of programmers and non-technical users.  So now that the Macro has been created by the programmer, it has been assigned/converted to a template for use by non-technical users. This makes the process easy-to-use, scaleable, reproducible, and secure.

Creating a Template is just like setting up any Conductrics Agent. The only difference is that by assigning the Agent to a Macro, it will become a Template that can be used to generate new tests easily.

For example, here I have created an Agent name ‘Headline Optimization’. In the bottom portion of the set-up page, I select ‘Template’. This brings up a list of all of the Macros I am authorized to use. In this case, there is just the ‘Headline Optimization’ Macro we just created. By selecting this Macro, the Agent will be converted into a Template for all of the ‘Headline Optimization’ tests going forward.

Now comes the amazing part. All the test creator needs to do is go to the Conductrics Agent List Page, and they will see the custom button created for Headline Tests.

Clicking the ‘Headline Optimization’ button will bring up the custom form. For our simple ‘Headline Optimization’ example, it looks like this:

Notice that it asks for two pieces of information, the Post Id and the alternative Article Headline to test (you can add multiple alternative headlines to each test using the ‘Add Another’ option).

Once the Post Id and the alternative headline are entered, the test author just clicks ‘Create’ and that’s it! The test will be scheduled and pushed live.

Not only does this make it super simple for non-technical users to set up hundreds of these tests easily, but it also provides guard rails to prevent accidental, or unintended, selections of erroneous page sections.

Communication of test results are automated with Conductrics notification streams.  Users receive top-level results of each ‘Headline Optimization’ test directly to their Slack channel including company members who are not Conductrics users.  So all relevant stakeholders can also be part of the discussion around what types of Headlines seem to be most compelling and effective.

Here is a simple example – once the Conductrics bandit algorithm has selected a winner, a notification like the following is sent to the client’s Slack with the following summary information.

The winning variation is noted, along with the number of visitors and the click through rate for each headline.

### Conclusion

In this example, the client was able to scale from a handful of tests per month to hundreds of tests per month, and the guard rails allowed multiple non-technical users to have more control over the testing while freeing the developers to work on more complex problems.

Express Macros and Templates are the ideal solution for digital marketers and CX professionals who have multiple, repeatable versions of a particular test design. They streamline the process, allow for set up in an easy-to-use form, and ensure compliance by controlling what can be modified. Express Macros solve the problem of so many ideas, so little time. If you would like to learn more about Conductrics Express Macros and Templates, please contact us.

## Getting Past Statistical Significance: Foundations of AB Testing and Experimentation

How often is AB Testing reduced to the following question:  ‘what sample size do I need to reach statistical significance for my AB Test?’  On the face of it, this question sounds reasonable. However, unless you know why you want to run a test at particular significance level, or what the relationship is between sample size and that significance level, then you are most likely missing some basic concepts that will help you get even more value out of your testing programs.

There are also a fair amount of questions around how to run AB Tests; what methods are best; and the various ‘gotchas’ to look out for.  In light of this, I thought it might be useful to step back, and just review some of the very basics of experimentation and why we are running hypothesis tests in the first place. This is not a how to guide, nor a collection of different types of tests to run, or even a list of best practices.

What is the problem we are trying to solve with experiments?

We are trying to isolate the effect on some objective result, if any, of taking some action in our website (mobile apps, call centers, etc.). For example, if we change the button color to blue rather than red, will that increase conversions, and if so, by how much?

What is an AB test?

AB and multivariate tests are versions of randomized controlled trials (RCT). A RCT is an experiment where we take a sample of users and randomly assign them to control and treatment groups. The experimenter then collects performance data, for example conversions or purchase values, for each of the groups (control, treatment).

I find it useful to think of RCTs as having three main components:  1) data collection; 2) estimating effect sizes; and 3) assessing our uncertainty of the effect size and mitigating certain risks around making decisions based on these estimates.

## Collection of the Data

What do you mean by sample?

A sample is a subset of the total population under investigation. Keep in mind that in most AB testing situations, while we randomly assign users to treatments, we don’t randomly sample. This may seem surprising, but in the online situation the users present themselves to us for assignment (e.g. they come to the home page).  This can lead to selection bias if we don’t try to account for this nonrandom sampling in our data collection process. Selection bias will make it more difficult, if not impossible, to draw conclusions about the population we are interested in from our test results. One often effective way to mitigate this is by running our experiments over full weeks, or months etc. to try to ensure that our samples look as much as possible like our user/customer population.

Why do we use randomized assignments?

Because of “Confounding”. I will repeat this several times, but confounding is the single biggest issue in establishing a causal relation between our treatments and our performance measure.

What is Confounding?

Confounding is when the treatment effect gets mixed together with the effects from any other outside influence. For example, consider we are interested in the treatment effect of Button Color (or Price, etc.) on conversion rate (or average order size, etc). When assigning users to button color we give everyone who visits on Sunday the ‘Blue’ button treatment, and everyone on Monday the ‘Red’ button treatment. But now the ‘Blue’ group is comprised of both Sunday users and the Blue Button, and the ‘Red’ group is both Monday users and the Red Button. Our data looks like this:

 Sunday:Red 0% Monday:Red 100% Sunday:Blue 100% Monday:Blue 0%

We have mixed together the data such that any of the user effects related to day are tangled together with the treatment effects of button color.

What we want is for each of our groups to both: 1) look like one another except for the treatment selection (no confounding); and 2) to look like the population of interest (no selection bias).

If we randomly assign the treatments to users, then we should on average get data that looks like this:

 Sunday:Red 50% Monday:Red 50% Sunday:Blue 50% Monday:Blue 50%

Where each day we have a 50/50 split of button color treatments.  Here the relationship between day and button assignment is broken, and we can estimate the average treatment effects without having to worry as much about influences of outside effects (this isn’t perfect of course, since it holds only on average – it is possible due to sampling error that for any given sample we don’t have a balanced sample over all of the cofactors/confounders.)

Of course, this mixing need not be this extreme – it is often much more subtle. When Ronny Kohavi advises to be alert to ‘sample ratio mismatch’, (See: https://exp-platform.com/hbr-the-surprising-power-of-online-experiments/), it is because of confounding. For example, say a bug in the treatment arm breaks the experience in such a way that some users don’t get assigned. If this happens only for certain types of users, perhaps just for users on old browsers, then we no longer have a fully randomized assignment.  The bug breaks randomization and lets the effect of old browsers leak in and mix with the treatment effect.

Confounding is the main issue one should be concerned about in AB Testing.  Get this right and you are most of the way there – everything else is secondary IMO.

## Estimating Treatment Effects

We made sure that we got random selections, now what?

Well, one thing we might want to do is use our data to get an estimate of the conversion rate (or AOV etc.) for each group in our experiment.  The estimate from our sample will be our best guess of what the true treatment effect will be for the population under study.

For most simple experiments we usually just calculate the treatment effect using the sample mean from each group and subtract the control from the treatment –  (Treatment Conversion Rate) – (the Control Conversion Rate) = the Treatment Effect.  For example, if we estimate that Blue Button has a conversion rate of 0.1 (1%) and Red Button has a conversion rate of 0.11 (1.1%), then the estimated treatment effect is -0.01.

## Estimating Uncertainty

Of course the goal isn’t to calculate the sample conversion rates, the goal is to make statements about the population conversion rates.  Our sample conversion rates are based on the particular sample we drew. We know that if we were to have drawn another sample, we almost certainly would have gotten different data, and would calculate a different sample mean (if you are not comfortable with sampling error, please take a look at https://conductrics.com/pvalues).

One way to assess uncertainty is by estimating a confidence interval for each treatment and control group’s conversion rate. The main idea is that we construct an interval that is guaranteed to contain, or trap, the true population conversion rate with a frequency that is determined by the confidence level. So a 95% confidence interval will contain the true population conversion rate 95% of the time.  We could also calculate the difference in conversion rates between our treatment and the control group’s and calculate a confidence interval around this difference.

Notice that so far we have been able to: 1) calculate the treatment effect; and 2) get a measure of uncertainty in the size of the treatment effect with no mention of hypothesis testing.

## Mitigating Risk

If we can estimate our treatment effect sizes and get a measure of our uncertainty around the size, why bother running a test in the first place? Good question.  One reason to run a test is to control for two types of error we can make when taking an action on the basis of our estimated treatment effect.

Type 1 Error – a false positive.

1. One Tail: We conclude that the treatment has a positive effect size (it is better than the control) when it doesn’t have a real positive effect (it really isn’t any better).
2. Two Tail: We conclude that the treatment has a different effect than the control (it is either strictly better or strictly worse) when it doesn’t really have a different effect than the control.

Type 2 Error – a false negative.

1. One Tail: We conclude that the treatment does not have a positive effect size (it isn’t better than the control) when it does have a real positive effect (it really is better).
2. Two Tail: We conclude that the treatment does not have a different effect than the control (it isn’t either strictly better or strictly worse) when it really does have a different effect than the control.

How to specify and control the probability of these errors?

Controlling Type 1 errors – the probability that our test will make a Type 1 error is called the significance level of the test. This is the alpha level you have probably encountered. An alpha of 0.05 means that we want to run the test so that we only make Type 1 errors up to 5% of the time. You are of course free to pick whatever alpha you like – perhaps an alpha of 1% may make more sense for your use case, or maybe an alpha of 0.1%. It is totally up to you! It all depends on how damaging it would be for you take some action based on a positive result, when the effect doesn’t exist. Also keep in mind that this does NOT mean that if you get a significant result, that only 5% (or whatever your alpha is) of the time it will be a false positive.   The rate that a significant result is a false positive will depend on how often you run tests that have real effects.  For example, if you never run any experiments where the treatments are actually any better than the control, then all of your significant results will be false positives.  In this worse case situation, you should expect to see significant results in up to 5% (alpha%) of your tests, and all of them will be false positives (Type 1 errors).

You should spend as much time as needed to grok this idea, as it is the single most important idea you need to know in order to thoughtfully run your AB Tests.

Controlling Type 2 errors – this is based on the power of the test, which is turn based on the beta. For example, a beta of 0.2 (Power of 0.8) means that of all of the times that the treatment is actually superior to the control, your test would fail, on average, to discover this up to 20% of the time. Of course, like the alpha, it is up to you, so maybe power of 0.95 makes more sense, so that you make a type 2 error only up to 5% of the time.  Again, this will depend on how costly you consider this type of mistake.  This is also important to understand well, so spend some time thinking about what this means.  If this isn’t totally clear, see a more detailed explanation of Type 1 and Type 2 errors here https://conductrics.com/do-no-harm-or-ab-testing-without-p-values/.

What is amazing, IMO, about hypothesis tests is that, assuming that you collect the data correctly, you are guaranteed to limit the probability of making these two types of errors based on the alpha and beta you pick for the test. Assuming we are mindful about confounding, all we need to do is collect the correct amount of data. When we run the test after we have collected our pre-specified sample, we can be assured that we will control these two errors at our specified levels.

### “The sample size is the payment we must make to control Type 1 and Type 2 errors.”

There is a relationship between alpha, beta, and the associated sample size. In a very real way, the sample size is the payment we must make in order to control Type 1 and Type 2 errors. Increasing the error control on one means you either have to lower the control on the other or increase the sample size.  This is what power calculators are doing under the hood — calculating the sample size needed, based on a minimum treatment effect size, and desired alpha and beta.

What about Sample Size for continuous conversion values, like average order value?

Calculating sample sizes for continuous conversion variables is really the same as for conversion rates/proportions. For both we need to have some guess of the both the mean of the treatment effect and the standard deviation of the effect.  However, because the standard deviation of a proportion is determined by its mean, we don’t need to bother to provide it for most calculators. However, for continuous conversion variables we need to have an explicit guess of the standard deviation in order to conduct the sample size calculation.

What if I don’t know the standard deviation?

This isn’t exact, and in fact it might not be that close, but in a pinch, you can use the range rule  as a hack for the standard deviation.  If you know the minimum and maximum values that the conversion variable can take (or some reasonable guess), you can use standard deviation ⩰ (Max-Min)/4 as a rough guess.

What if I make a decision before I collected all of the planned data?

You are free to do whatever you like. Trust me, there is no bolt of lightning that will come out of the sky if you stop a test early, or make a decision early. However, it also means that the Type 1 and Type 2 risk guarantees that you were looking to control for will no longer hold. So to the degree that they were important to you and the organization, that will be cost of taking an early action.

What about early stopping with sequential testing?

Yes, there are ways to run experiments in a sequential way.  That said, remember how online testing works. Users present themselves to the site (or app or whatever), and then we randomly assign them to treatments. That is not the same as random selection.

Why does that matter?

Firstly, because of selection bias. If users are self selecting when they present themselves to us, and if there is some structure to when different types of users arrive, then our treatment effect will be a measure of only the users who we have seen, and won’t be a valid measure of the population we are interested in. As mentioned earlier, often the best way to deal with this is to sample in the natural period of your data – normally this is weekly or monthly.

Secondly, while there are certain types of sequential tests that don’t bias our Type 1 and 2 errors they do, ironically, bias our estimated treatment effect – especially when stopping early, which is the very reason you would run a sequential test in the first place. Early stopping can lead to a type of magnitude bias – where the absolute value of the reported treatment effects will tend to be too large.  There are ways to adjust try to adjust for this,  but it adds even more approximation and complexity into the process.
See https://www.ncbi.nlm.nih.gov/pubmed/22753584  and http://journals.sagepub.com/doi/abs/10.1177/1740774516649595?journalCode=ctja

So the fix for dealing with the bias in Type 1 error control due to early stopping/peaking CAUSES bias in the estimated treatment effects, which, presumably, are also of importance to you and your organization.

The Waiting Game

However, if all we do is just wait –  c’mon, it’s not that hard 😉 – and run the test after we collect our data based on the pre-specified sample size, and in weekly or monthly blocks, then we don’t have to deal with any issues of selection bias or biased treatment effects. This is one of those cases where just doing the simplest thing possible gets you the most robust estimation and risk control.

What if I have more than one treatment?

If you have more than one treatment you may want to adjust your Type 1 error control to ‘know’ that you will be making several tests at once. Think of each test as a lottery ticket. The more tickets you buy, the greater the chance you will win the lottery, where ‘winning’ here means making a Type 1 error.

The chance of making a single Type 1 error over all of the treatments is called the Familywise Error Rate (FWER). The more tests, the more likely you are to make a Type 1 error at a certain confidence level (alpha). I won’t get into the detail here, but to ensure that the FWER is not greater than your alpha, you can use any of the following methods:  Bonferroni; Sidak; Dunnetts etc. Bonferroni is least powerful (in the Type 2 error sense), but is the simplest with the least assumptions, so a good safe bet, esp if Type 1 error is a very real concern. One can argue which is best, but it will depend, and for just a handful of comparisons it won’t really matter what correction you use IMO.

Another measure of familywise error is the False Discovery Rate (FDR) (See “>https://en.wikipedia.org/wiki/False_discovery_rate).  To control for FDR, you could do something like the Benjamini–Hochberg procedure.  While controlling for the FDR means a more powerful test (less Type 2 error), there is no free lunch, and it is at the cost of allowing more Type 1 errors. Because of this, researchers often use the FDR as a first step to screen out possibly interesting treatments in cases where there are many (thousands) of independent tests. Then, from the set of significant results, more rigorous follow up testing occurs.  Claims about preference around controlling for either FDR or FWER are really implicit statements about relative risk of Type 1 and Type 2 error.

Wrapping it up

The whole point of the test is to control for risk – you don’t have to run any tests to get estimates of treatment effects, or a measure of uncertainty around those effects. However, it is often is a good idea to control for these errors, so the more you understand their relative costs, the better you can determine how much you are willing pay to reduce the chances of making them. Rather then look at the sample size question as a hassle, perhaps look at it as an opportunity for you and your organization to take stock and discuss what the goals, assumptions, and expectations are for the new user experiences that are under consideration.

## What is the value of my AB Testing Program?

Occasionally we are asked by companies how they should best assess the value of running their AB testing programs. I thought it might be useful to put down in writing some of the points to consider if you find yourselves asked this question.

With respect to hypothesis tests, there are two main sources of value:
1) The Upside – reducing Type 2 error.
This is implicitly what people tend to think about in Conversion Rate Optimization (CRO) – the gains view of testing. When looking to value testing programs, they tend to ask something along the lines of ‘What are the gains that we would miss if we didn’t have a testing program in place?’ One possible approach to measure this is to reach back into the testing toolkit and create a test program control group.  The users assigned to this control group are then shielded from any of  the changes made based on outcomes from the testing process. This control group is then used to estimate a global treatment effect for the bundle of changes over some reasonable time horizon (6 months, a year etc.)  The calculation looks something like:

Total Site Conversion – Control Group Conversion – cost of the testing program.

You can think of this as a sort of meta AB Test.

Of course, in reality, this isn’t going to be easy to do, as forming a clean global control group will often be complicated, if not impossible, and determining how to value the lift over the various possible conversion measure each individual test may have used can be tricky – especially in non commerce applications.

2) The Downside – mitigating Type 1 loss.
However, if we only consider the explicit gains from our testing program, we ignore another reason for testing – the mitigation of Type 1 errors. Type 1 errors are changes in behaviors that lead to harm, or loss. To estimate the value of mitigating this possible loss, we would need to expose to our control group the changes that we WOULD have made on the site had they not been rejected by our testing. That means that we would need to make changes to the meta control group’s experiences that we have strong reason to think would harming, and degrade their experience. Of course this is almost certainly a bad idea, let alone potentially unethical, and it highlights why certain types of questions are not amenable to randomized controlled trials (RCT) – the backbone of AB Testing.

(Anyone out there using instrumental variables or other counterfactual methods? Please comment if you are).

(for a refresher on Type 1 and Type 2 errors please see https://conductrics.com/do-no-harm-or-ab-testing-without-p-values/)

But even if we did go down this route (bad idea), it still doesn’t get us a proper estimate of the true value of testing, since even if we don’t encounter harmful events, we still were protected against them.  For example, you may have collision car insurance but have had no accidents over the past year. What was the value of the collision insurance, zero? You sure?  The value of insurance isn’t equal to the amount that ultimately gets paid out. Insurance doesn’t work like that – it is good that is consumed regardless if it pays out or not. What you are paying for is the reduction in downside risk – and that is something that testing provides regardless if the adverse event occurs or not.  The difficult part for you is to assess the probability (risk or maybe Knightian uncertainty), and severity of the potential negative results.

The main take away is that the value of testing is in both optimization (finding the improvements); and in mitigating downside risk. To value the latter, we need to be able to price what is essentially a form of insurance against whatever the org considers to be intolerable downside risk. It is like asking what is the value of insurance, or privacy policies, or security policies. You may get by in any given year without them, but as you scale up, the risks of downside events grow, making it more and more likely that a significant adverse event will occur.

One last thing. Testing programs tend to jumble together the related, but separate concepts of hypothesis testing, the binary decision of  Reject/Fail to Reject the outcome, with the estimation of effect sizes, the best guess for the ‘true’ population conversion rates.  I mention this because often we just think about the value of the actions taken based the hypothesis tests, rather than also considering the value of robust estimates of the effect sizes for forecasting, ideation, and for helping allocate future resources  (as an aside, one can run an experiment that has a robust hypotheses test, but also yields a biased estimate of the effect size (magnitude error). [Sequential testing I’m looking at you!]

Ultimately, testing can be seen as both a profit (upside discovery) AND cost (downside mitigation) center.  Just focusing on one will lead to underestimating the value your testing program can provide to the organization.  That said, it is a fair question to ask, and one that hopefully will help lead to extracting even more value from your experimentation efforts.

What are your thoughts? Is there any thing we are missing, or should consider? Please feel free to comment and let us know how you value your testing program.

## Do No Harm or AB Testing without P-Values

A few weeks ago I was talking with Kelly Wortham during her excellent AB Testing webinar series.  During the conversation, one of the attendees asked if they just wanted to pick between A and B, did they really need to run standard significance tests at a 90% or 95% confidence levels?

The simple answer is no.  In fact, in certain cases, you can avoid dealing with p-values (or priors and posteriors) altogether and just pick the option with the highest conversion rate.

Even more interesting, at least to me, is that simple approach can be viewed as either form of classical hypothesis testing or as an epsilon- first solution to the multi-arm bandit problem.

Before we get into our simple testing trick, it might be helpful to first revisit a few important concepts that underpin why we are running tests in the first place.

The reason we run experiments is to help determine how different marketing interventions will affect the user experience.  The more data we collect, the more information we have, which reduces our uncertainty about the effectiveness of each possible intervention.  Since data collection is costly, the question that always comes up is ‘how much data do I really need to collect?’

In a way, every time we run an experiment, we are trying the balance the following: 1) the cost of making a Type 1 error; 2) the cost of making a Type 2 error; and 3) the cost of data collection to help reduce our risk of making either of these errors.
To help answer this question, I find it helpful to organize optimization problems into two high level classes of problems:

1) Do No Harm
These are problems where there is either:
1. an existing process, or customer experience, that the organization is counting on for at least a certain minimum level of performance.
2. a direct cost associated with implementing a change.
For example, while it would be great if we could increase conversions from an existing check out process, it may be catastrophic if we accidentally reduced the conversion rate. Or, perhaps we want to use data for targeting offers, but there is a real direct cost we have to pay in order to use the targeting data.  If it turns out that there is no benefit to the targeting, we will incur the additional data cost without any upside, resulting in a net loss.
So, for the ‘Do No Harm’ type of problem we want to be pretty sure that if we do make a change, it won’t make things worse. For these problems we want to stay the current course unless we have strong evidence to take an alternative action.

2) Go For It
In the ‘Go For It’ type of problem there often is no existing course to follow.  Here we are selecting between two or more novel choices AND we have symmetric costs, or loss, if we make a Type I error (reviewed below).

A good example is headline optimization for news articles.  Each news article is, by definition, novel, as are the associated headlines.  Assuming that one has already decided to run headline optimization (which is itself a ‘Do No Harm’ question), there is no added cost, or risk to selecting one or the other headlines when there is no real difference in the conversion metric between them. The objective of this type of problem is to maximize the chance of finding the best option, if there is one. If there isn’t one, then there is no cost or risk to just randomly select between them (since they perform equally as well and have the same cost to deploy).  As it turns out, Go For It problems are also good candidates for Bandit methods.

State of the World
Now that we have our two types of problems defined, we can ask under what situations we might find ourselves when we finally make a decision (i.e. select ‘A’ or ‘B’).  There are two possible states of the world when we make our decisions:
1. There isn’t the expected effect/difference between the options
2. There is the expected effect/difference between the options
It is important to keep in mind that in almost all cases we won’t be entirely certain what the true state of the world is, even after we run our experiment (you can thank David Hume for this).  This is where our two error types, Type I and Type II come into play. You can think of these two error types as really just two situations where our Beliefs about the world are not consistent with the true state of the world.
A Type I error is when we pick the alternative option (the ‘B’ option), because we mistakenly believe the true state of the world is ‘Effect’.  Alternatively, a Type II error is when we pick ‘A’ (stay the course), thinking that there is no effect, when the true state of the world is ‘Effect’.

The difference between the ‘Do No Harm’ and ‘Go For It’ problems is in how costly it is to make a Type I error.
The table below is the payoff matrix for each error for ‘Do No Harm’ problems
Payoff: Do No Harm      The True State of the World (unknown)
 Decision Expected Effect No Expected Effect Pick A Opportunity Costs No Cost Pick B No Opportunity Cost Cost
Notice, that if we pick B when there is no effect, we make a Type I error and suffer a cost.  Now lets look at the payoff table for the ‘Go For It’ problem.
Payoff: Go For It           The True State of the World (unknown)
 Decision Expected Effect No Expected Effect Pick A Opportunity Costs No Cost Pick B No Opportunity Cost No Cost

Notice that the payoff tables for Do No Harm and Go For It are the same when the true state of the world is that there is an effect.  But, they differ when there is no effect. When there is no effect, there is NO relative cost in selecting either A or B.

Why is this way of thinking about optimization problems useful?
Because this can help with what type of approach to take based on the problem.
In the Do No Harm problem we need to be mindful about Type I errors, because they are costly, so we need to factor in the risk of making them when we design our experiments.  Managing this risk is exactly what classical hypothesis testing does.
That is why for ‘Do No Harm’ problems, it is best practice to run a classic, robust, AB Test.  This is because we care more about minimizing our risk of doing harm (the cost of Type I error) than any benefit we might get from rushing through the experiment (cost of information).

However, it also means that if we have a ‘Go For It’ problem, if there is no effect, we don’t really care how we make our selections.  Picking randomly when there is no effect is fine, as each of the options have the same value.  It is this case where our simple test of just picking the highest value option makes sense.

## Go For It: Tests with no P-Values

Finally we can get to the simple, no p-value test.  This test guarantees that if there is a true difference of the minimum discernible effect (MDE), or larger, one will choose the better-performing arm X% of the time, where X is the power of the test.
Here are the steps:

1) Calculate the sample size
2) Collect the data
3) Pick whichever option has the highest raw conversion value. If a tie, flip a coin.

Calculate the sample size almost exactly the same as in a standard test:  1) pick a minimum detectable effect (MDE) – this is our minimum desired lift; 2) select the power of the test.

Ah, I hear you asking ‘What about the alpha, don’t we need to select a confidence level?’ Here is the trick. We want to select randomly when there is no effect. By setting alpha to 0.5, the test Reject the null 50% of the time when null is true (no effect).

Lets go through a simple example to make this clear.  Lets say your landing page tends to have a conversion rate of around 4%.  You are trying out a new alternative offer, and a meaningful improvement for your business would a lift to a 5% conversion rate. So the minimum detectable effect (MDE) for the test is 0.01 (1%).

We then estimate the sample size needed to find the MDE if it exists. Normally, we pick an alpha of 0.05 , but now we are instead going to use an alpha of 0.5.  The final step is to pick the power of the test, lets use a good one, 0.95 (often folks pick, 0.8, but for this case we will use 0.95).

You can use now use your favorite sample size calculator (for Conductrics users this is part of the set up work flow).

If you use R, this will look something like:

power.prop.test(n = NULL, p1 = 0.04, p2 = 0.05, sig.level = 0.5, power = .95,
alternative =”one.sided”, strict = FALSE)

This gives us a sample size of 2,324 per option, or 4,648 in total.  If we were to run this test with a confidence of 95% (alpha=0.05) would need to have almost four times the traffic, 9,299 per options, or 18,598 in total.

The following is a simulation of 100K experiments, were each experiment selected each arm 2,324 times.  The conversion rate for B was set to 5% and 4% for A. The chart below plots the difference in the conversion rates between A and B.  Not surprisingly, it is centered on the true difference of 0.01.  The main thing to notice, is that if we pick the option with the highest conversion rate we pick B 95% of the time, which is exactly the power we used to calculate the sample size!

Notice – no p-values, just a simple rule to pick whatever is performing best, yet we still get all of our power goodness! And we only needed about a fourth of the data to reach that power.
Now lets see what our simulation looks like when both A and B have the same conversion rate of 4% (Null is true).
Notice that the difference between A and B is centered at ‘0’, as we would expect. Using our simple decision rule, we pick B 50% of the time and A 50% of the time.

Now, if we had a Do No Harm problem, this would be a terrible way to make our decisions because half the time we would select B over A and incur a cost. So you still have to do the work and determine your relative costs around data collection, Type 1, Type II errors.
While I was doing some research on this, I came across Georgi Z. Georgiev’s Analytics tool kit. It has a nice calculator that lets you select your optimal risk balance between there three factors. He also touches on running tests with an alpha of 0.5 in this blog post. Go check it out.

As I mentioned above, we can also think of our Go For It problems as a bandit.  Bandit solutions that first randomly collect data and then apply the ‘winner’ are known as epsilon-first (To be fair, all AB Testing for decision making can be thought of as Epsilon-first). Epsilon stands for how much of your time you spend during the data collection phase.  In this way of looking at the problem, the sample size output from our sample size calculation (based on MDE and Power), is our Epsilon – how long we let the bandit collect data in order to learn.
What is interesting, is that at least in the two option case, this easy method gives us roughly the same results an adaptive Bandit method will.  Google has a nice blog post on Thompson Sampling, which is a near optimal way to dynamically solve bandit problems. We also use Thompson Sampling here at Conductrics, so I thought it might be interesting to compare their results on the same problem.
In one example, they run a Bandit with two arms, one with a 4% conversion rate, and the other with a 5% conversion rate – just like our example. While they show the Bandit performing well, needing only an average of 5,120 samples, you will note that that is still slightly higher than the fixed amount we used (4,648 samples) in our super simple method.
This doesn’t mean that we don’t ever want to use Thompson Sampling for bandits. As we increase the number of possible options, and many of those options are strictly worse than the others, running Thompson Sampling or another adaptive design can make a lot of sense. (That said, by using a multiple comparison adjustments, like the Šidák correction, I found that one can include K>2 arms in the simple epsilon-first method and still get Type 2 power guarantees. But, as I mentioned, this becomes a much less competitive approach if there are arms that are much worse than the MDE.)

### The Weeds

You may be hesitant to believe that such a simple rule can accurately help detect an effect.  I checked in with Alex Damour, the Neyman Visiting Assistant Professor over at UC Berkeley and he pointed out that this simple approach is equivalent to running a standard t-test of the following form. From Alex:

“Find N such that P(meanA-meanB < 0 | A = B + MDE) < 0.05. This is equal to the N needed to have 95% power for a one-sided test with alpha = 0.5.

Proof: Setting alpha = 0.5 sets the rejection threshold at 0. So a 95% power means that the test statistic is greater than zero 95% of the time under the alternative (A = B + MDE). The test statistic has the same sign as meanA-meanB. So, at this sample size, P(meanA – meanB > 0 | A = B + MDE) = 0.95.”
To help visualize this, we an rerun our simulation, but run our test using the above formulation.
Under the Null (No Effect) we have the following result
We see that the T-scores are centered around ‘0’.  At alpha=0.5, the critical value will be ‘0’.  So any T-score greater than ‘0’ will lead to a ‘Rejection’ of the null.
If there is an effect, then we get the following result.
Our distribution of T-scores is shifted to the right, such that only 5% of them (on average) are below ‘0’.

### A Few Final Thoughts

Interestingly, at least to me, is that the alpha=0.5 way to solve the ‘Go For It’ problems straddles two of our main approaches in our optimization toolkit.  Depending how you look at it, it can be seen as either: 1) A standard t-test (albeit one with the critical value set to ‘0’); or 2) as an epsilon-first approach to solve a multi-arm bandit.
Looking at it this way, the  ‘Go For It’ way of thinking about optimization problems can help bridge our understanding between the two primary ways of solving our optimization problems. It also hints that as one moves away from Go For It into Do No Harm (higher Type 1 costs), perhaps classic, robust hypothesis testing is the best approach. As we move toward Go For It, one might want to rethink the problem as a multi-arm bandit.
Have fun, don’t get discouraged, and remember optimization is HARD – but that is what makes all the effort required to learn worth it!
Posted in Uncategorized | 5 Comments