Video: Conductrics Platform Overview

Nate Weiss, Conductrics CTO, demonstrates key new features and functionalities of the latest software release.  This latest release focuses on streamlining and improving the user experience for faster and easier execution of optimization programs.
The new features that are demonstrated include: 
  • A revamped and simplified user experience
  • Streamlined A/B Testing/Optimization workflows
  • Improved Program Management tools
  • Upgraded API and mobile developer libraries

Posted in Uncategorized | Leave a comment

Conductrics Announces Updated Release of its SaaS Experimentation Platform

 SEPTEMBER 24, 2020 – AUSTIN, TX – Digital experimentation and Artificial Intelligence (AI) SaaS company Conductrics, today announced its latest major release of its experience optimization platform built expressly for marketers, developers, and IT professionals. The updated platform, which shares the company’s name, is a cloud-based A/B testing and decision optimization engine. This latest release focuses on streamlining and improving the user experience for faster and easier execution of optimization programs.   

These upgrades make it easier for clients to scale optimization programs across different organizational and functional teams in order to deliver ideal digital experiences for their customers. The new features include: 

  • A revamped and simplified user experience (UX), 
  • Streamlined A/B Testing and Optimization workflows, 
  • Improved Program Management tools,  
  • Upgraded API and mobile developer libraries.

“Since our start in 2010, our goal has been to make it faster and easier for developers and marketers to work together in order to confidently discover and deliver the best customer experiences,” says Matt Gershoff, Conductrics’ co-founder and CEO. “As other technologies morph and become increasingly more complex, we remain focused on developing accessible, leading-edge optimization and experimentation technology.” 

The new release will be available in mid-October.  Current clients will have the option to use the legacy platform or the new platform – no action is needed on their part. A webinar will be held on October 13th to demonstrate the new features and benefits – a link to register is on the company website.

About Conductrics

In 2010, Conductrics released one of the industry’s first REST APIs for delivering AB Testing, multi-arm bandits and predictive targeting to empower both Marketing and IT professionals. With Conductrics, marketers, product managers, and consumer experience stakeholders can quickly and easily optimize the customer journey, while IT departments benefit from the platform’s simplicity, ease of use, and integration with existing technology stacks. Visit Conductrics at 

For more information, contact

Posted in Uncategorized | Leave a comment

Headline Optimization at Scale: Conductrics Macros

Conductrics has a history of providing innovative approaches to experimentation, testing and optimization.  In 2010, we introduced one of the industry’s first REST API for experimentation which also supported multi-arm bandits and predictive targeting.  Continuing our goal to provide innovative solutions, we’ve developed Express Macros and Templates to safely and easily build tests at scale.

To illustrate, let’s say you frequently run headline testing on a news page or change certain areas of a landing page over and over. In such situations, you don’t want test authors to have full page-editing capabilities – you just want them to have access to specific sections of the page or the site. Additionally, because variations of the same basic test will be conducted over and over, it is imperative that the test setup is simple, easy to repeat and scaleable.

Conductrics Templates and Express Macros

Conductrics Templates and Express Macros are an answer to this. They are an easy way to create reusable forms that allow your team to set up and run multiple versions of similar tests just by filling out a simple form and a click of a button. 

EXAMPLE: When to Use Conductrics Express Macros

One of our national media clients wanted to optimize News headlines for their news homepage. This meant that rather than just provide a single Headline for each news article, the client wanted to try out several potential Headlines for each story, see which worked best, and then use the winning headline going forward. To do this, they wanted to take advantage of Conductrics multi-armed bandits, which automatically discover, and deploy the most effective Headline for each story. 

However, they have scores of articles each day, so they needed to be able to run hundreds of these tests every month. In addition, these tests needed to be set up by multiple, non-technical editors safely, so as to not risk breaking the homepage. 

This is where Express Macros helped. Macros let the client extend and customize the Conductrics test creation process by:

1) creating simple custom forms to make it easy to set up each test; and

2) applying custom JavaScript to that form in order for the test to execute properly. 

How the Express Macro Works

For example, this macro will create a form with two input fields, “Post Id” and “Article Headline”. You can, of course, create any number of fields that are needed.

Now that we have specified what data to request from the test creator, we now just need to provide the JavaScript that will use the values taken in by the Macro’s form to run the test. In this case we will want to use the ‘Post Id’ (alternatively, this could be a URL or some other identifier) to tell Conductrics on which article to run the test. We also include in our JavaScript logic to swap in the alternative Headline(s) for the test. 

Here is an example of what that might look like: 

While this might look complicated if you don’t know JavaScript, don’t worry, this is something any front-end developer can do easily (or you can just ask us for help). 

All that there is left to do is to name and save it. I have named it ‘Headline Optimization’.

There is just one last step before we can let our Headline editors start to run tests, and that is to assign the Macro to a Template. 

Template Example

Express Macros was developed to bridge the workflows of programmers and non-technical users.  So now that the Macro has been created by the programmer, it has been assigned/converted to a template for use by non-technical users. This makes the process easy-to-use, scaleable, reproducible, and secure.  

Creating a Template is just like setting up any Conductrics Agent. The only difference is that by assigning the Agent to a Macro, it will become a Template that can be used to generate new tests easily. 

For example, here I have created an Agent name ‘Headline Optimization’. In the bottom portion of the set-up page, I select ‘Template’. This brings up a list of all of the Macros I am authorized to use. In this case, there is just the ‘Headline Optimization’ Macro we just created. By selecting this Macro, the Agent will be converted into a Template for all of the ‘Headline Optimization’ tests going forward. 

Now comes the amazing part. All the test creator needs to do is go to the Conductrics Agent List Page, and they will see the custom button created for Headline Tests. 

Clicking the ‘Headline Optimization’ button will bring up the custom form. For our simple ‘Headline Optimization’ example, it looks like this:

Notice that it asks for two pieces of information, the Post Id and the alternative Article Headline to test (you can add multiple alternative headlines to each test using the ‘Add Another’ option).

Once the Post Id and the alternative headline are entered, the test author just clicks ‘Create’ and that’s it! The test will be scheduled and pushed live. 

Not only does this make it super simple for non-technical users to set up hundreds of these tests easily, but it also provides guard rails to prevent accidental, or unintended, selections of erroneous page sections. 


Communication of test results are automated with Conductrics notification streams.  Users receive top-level results of each ‘Headline Optimization’ test directly to their Slack channel including company members who are not Conductrics users.  So all relevant stakeholders can also be part of the discussion around what types of Headlines seem to be most compelling and effective.

Here is a simple example – once the Conductrics bandit algorithm has selected a winner, a notification like the following is sent to the client’s Slack with the following summary information.

The winning variation is noted, along with the number of visitors and the click through rate for each headline. 


In this example, the client was able to scale from a handful of tests per month to hundreds of tests per month, and the guard rails allowed multiple non-technical users to have more control over the testing while freeing the developers to work on more complex problems. 

Express Macros and Templates are the ideal solution for digital marketers and CX professionals who have multiple, repeatable versions of a particular test design. They streamline the process, allow for set up in an easy-to-use form, and ensure compliance by controlling what can be modified. Express Macros solve the problem of so many ideas, so little time. If you would like to learn more about Conductrics Express Macros and Templates, please contact us.

Posted in Uncategorized | Leave a comment

Getting Past Statistical Significance: Foundations of AB Testing and Experimentation

How often is AB Testing reduced to the following question:  ‘what sample size do I need to reach statistical significance for my AB Test?’  On the face of it, this question sounds reasonable. However, unless you know why you want to run a test at particular significance level, or what the relationship is between sample size and that significance level, then you are most likely missing some basic concepts that will help you get even more value out of your testing programs.

There are also a fair amount of questions around how to run AB Tests; what methods are best; and the various ‘gotchas’ to look out for.  In light of this, I thought it might be useful to step back, and just review some of the very basics of experimentation and why we are running hypothesis tests in the first place. This is not a how to guide, nor a collection of different types of tests to run, or even a list of best practices.

What is the problem we are trying to solve with experiments?

We are trying to isolate the effect on some objective result, if any, of taking some action in our website (mobile apps, call centers, etc.). For example, if we change the button color to blue rather than red, will that increase conversions, and if so, by how much?

What is an AB test?

AB and multivariate tests are versions of randomized controlled trials (RCT). A RCT is an experiment where we take a sample of users and randomly assign them to control and treatment groups. The experimenter then collects performance data, for example conversions or purchase values, for each of the groups (control, treatment).

I find it useful to think of RCTs as having three main components:  1) data collection; 2) estimating effect sizes; and 3) assessing our uncertainty of the effect size and mitigating certain risks around making decisions based on these estimates.


Collection of the Data

What do you mean by sample?

A sample is a subset of the total population under investigation. Keep in mind that in most AB testing situations, while we randomly assign users to treatments, we don’t randomly sample. This may seem surprising, but in the online situation the users present themselves to us for assignment (e.g. they come to the home page).  This can lead to selection bias if we don’t try to account for this nonrandom sampling in our data collection process. Selection bias will make it more difficult, if not impossible, to draw conclusions about the population we are interested in from our test results. One often effective way to mitigate this is by running our experiments over full weeks, or months etc. to try to ensure that our samples look as much as possible like our user/customer population.

Why do we use randomized assignments?

Because of “Confounding”. I will repeat this several times, but confounding is the single biggest issue in establishing a causal relation between our treatments and our performance measure.

What is Confounding?

Confounding is when the treatment effect gets mixed together with the effects from any other outside influence. For example, consider we are interested in the treatment effect of Button Color (or Price, etc.) on conversion rate (or average order size, etc). When assigning users to button color we give everyone who visits on Sunday the ‘Blue’ button treatment, and everyone on Monday the ‘Red’ button treatment. But now the ‘Blue’ group is comprised of both Sunday users and the Blue Button, and the ‘Red’ group is both Monday users and the Red Button. Our data looks like this:

Sunday:Red 0%   Monday:Red 100%
Sunday:Blue 100%   Monday:Blue 0%

We have mixed together the data such that any of the user effects related to day are tangled together with the treatment effects of button color.

What we want is for each of our groups to both: 1) look like one another except for the treatment selection (no confounding); and 2) to look like the population of interest (no selection bias).

If we randomly assign the treatments to users, then we should on average get data that looks like this:

Sunday:Red 50%   Monday:Red 50%
Sunday:Blue 50%   Monday:Blue 50%

Where each day we have a 50/50 split of button color treatments.  Here the relationship between day and button assignment is broken, and we can estimate the average treatment effects without having to worry as much about influences of outside effects (this isn’t perfect of course, since it holds only on average – it is possible due to sampling error that for any given sample we don’t have a balanced sample over all of the cofactors/confounders.)

Of course, this mixing need not be this extreme – it is often much more subtle. When Ronny Kohavi advises to be alert to ‘sample ratio mismatch’, (See:, it is because of confounding. For example, say a bug in the treatment arm breaks the experience in such a way that some users don’t get assigned. If this happens only for certain types of users, perhaps just for users on old browsers, then we no longer have a fully randomized assignment.  The bug breaks randomization and lets the effect of old browsers leak in and mix with the treatment effect.

Confounding is the main issue one should be concerned about in AB Testing.  Get this right and you are most of the way there – everything else is secondary IMO.

Estimating Treatment Effects

We made sure that we got random selections, now what?

Well, one thing we might want to do is use our data to get an estimate of the conversion rate (or AOV etc.) for each group in our experiment.  The estimate from our sample will be our best guess of what the true treatment effect will be for the population under study.

For most simple experiments we usually just calculate the treatment effect using the sample mean from each group and subtract the control from the treatment –  (Treatment Conversion Rate) – (the Control Conversion Rate) = the Treatment Effect.  For example, if we estimate that Blue Button has a conversion rate of 0.1 (1%) and Red Button has a conversion rate of 0.11 (1.1%), then the estimated treatment effect is -0.01.

Estimating Uncertainty

Of course the goal isn’t to calculate the sample conversion rates, the goal is to make statements about the population conversion rates.  Our sample conversion rates are based on the particular sample we drew. We know that if we were to have drawn another sample, we almost certainly would have gotten different data, and would calculate a different sample mean (if you are not comfortable with sampling error, please take a look at

One way to assess uncertainty is by estimating a confidence interval for each treatment and control group’s conversion rate. The main idea is that we construct an interval that is guaranteed to contain, or trap, the true population conversion rate with a frequency that is determined by the confidence level. So a 95% confidence interval will contain the true population conversion rate 95% of the time.  We could also calculate the difference in conversion rates between our treatment and the control group’s and calculate a confidence interval around this difference.

Notice that so far we have been able to: 1) calculate the treatment effect; and 2) get a measure of uncertainty in the size of the treatment effect with no mention of hypothesis testing.

Mitigating Risk

If we can estimate our treatment effect sizes and get a measure of our uncertainty around the size, why bother running a test in the first place? Good question.  One reason to run a test is to control for two types of error we can make when taking an action on the basis of our estimated treatment effect.

Type 1 Error – a false positive.

  1. One Tail: We conclude that the treatment has a positive effect size (it is better than the control) when it doesn’t have a real positive effect (it really isn’t any better).
  2. Two Tail: We conclude that the treatment has a different effect than the control (it is either strictly better or strictly worse) when it doesn’t really have a different effect than the control.

Type 2 Error – a false negative.

  1. One Tail: We conclude that the treatment does not have a positive effect size (it isn’t better than the control) when it does have a real positive effect (it really is better).
  2. Two Tail: We conclude that the treatment does not have a different effect than the control (it isn’t either strictly better or strictly worse) when it really does have a different effect than the control.

How to specify and control the probability of these errors?

Controlling Type 1 errors – the probability that our test will make a Type 1 error is called the significance level of the test. This is the alpha level you have probably encountered. An alpha of 0.05 means that we want to run the test so that we only make Type 1 errors up to 5% of the time. You are of course free to pick whatever alpha you like – perhaps an alpha of 1% may make more sense for your use case, or maybe an alpha of 0.1%. It is totally up to you! It all depends on how damaging it would be for you take some action based on a positive result, when the effect doesn’t exist. Also keep in mind that this does NOT mean that if you get a significant result, that only 5% (or whatever your alpha is) of the time it will be a false positive.   The rate that a significant result is a false positive will depend on how often you run tests that have real effects.  For example, if you never run any experiments where the treatments are actually any better than the control, then all of your significant results will be false positives.  In this worse case situation, you should expect to see significant results in up to 5% (alpha%) of your tests, and all of them will be false positives (Type 1 errors).

You should spend as much time as needed to grok this idea, as it is the single most important idea you need to know in order to thoughtfully run your AB Tests.

Controlling Type 2 errors – this is based on the power of the test, which is turn based on the beta. For example, a beta of 0.2 (Power of 0.8) means that of all of the times that the treatment is actually superior to the control, your test would fail, on average, to discover this up to 20% of the time. Of course, like the alpha, it is up to you, so maybe power of 0.95 makes more sense, so that you make a type 2 error only up to 5% of the time.  Again, this will depend on how costly you consider this type of mistake.  This is also important to understand well, so spend some time thinking about what this means.  If this isn’t totally clear, see a more detailed explanation of Type 1 and Type 2 errors here

What is amazing, IMO, about hypothesis tests is that, assuming that you collect the data correctly, you are guaranteed to limit the probability of making these two types of errors based on the alpha and beta you pick for the test. Assuming we are mindful about confounding, all we need to do is collect the correct amount of data. When we run the test after we have collected our pre-specified sample, we can be assured that we will control these two errors at our specified levels.


“The sample size is the payment we must make to control Type 1 and Type 2 errors.”


What about Sample Size?

There is a relationship between alpha, beta, and the associated sample size. In a very real way, the sample size is the payment we must make in order to control Type 1 and Type 2 errors. Increasing the error control on one means you either have to lower the control on the other or increase the sample size.  This is what power calculators are doing under the hood — calculating the sample size needed, based on a minimum treatment effect size, and desired alpha and beta.


What about Sample Size for continuous conversion values, like average order value?

Calculating sample sizes for continuous conversion variables is really the same as for conversion rates/proportions. For both we need to have some guess of the both the mean of the treatment effect and the standard deviation of the effect.  However, because the standard deviation of a proportion is determined by its mean, we don’t need to bother to provide it for most calculators. However, for continuous conversion variables we need to have an explicit guess of the standard deviation in order to conduct the sample size calculation.

What if I don’t know the standard deviation?

This isn’t exact, and in fact it might not be that close, but in a pinch, you can use the range rule  as a hack for the standard deviation.  If you know the minimum and maximum values that the conversion variable can take (or some reasonable guess), you can use standard deviation ⩰ (Max-Min)/4 as a rough guess.

What if I make a decision before I collected all of the planned data?

You are free to do whatever you like. Trust me, there is no bolt of lightning that will come out of the sky if you stop a test early, or make a decision early. However, it also means that the Type 1 and Type 2 risk guarantees that you were looking to control for will no longer hold. So to the degree that they were important to you and the organization, that will be cost of taking an early action.

What about early stopping with sequential testing?

Yes, there are ways to run experiments in a sequential way.  That said, remember how online testing works. Users present themselves to the site (or app or whatever), and then we randomly assign them to treatments. That is not the same as random selection.

Why does that matter?

Firstly, because of selection bias. If users are self selecting when they present themselves to us, and if there is some structure to when different types of users arrive, then our treatment effect will be a measure of only the users who we have seen, and won’t be a valid measure of the population we are interested in. As mentioned earlier, often the best way to deal with this is to sample in the natural period of your data – normally this is weekly or monthly.

Secondly, while there are certain types of sequential tests that don’t bias our Type 1 and 2 errors they do, ironically, bias our estimated treatment effect – especially when stopping early, which is the very reason you would run a sequential test in the first place. Early stopping can lead to a type of magnitude bias – where the absolute value of the reported treatment effects will tend to be too large.  There are ways to adjust try to adjust for this,  but it adds even more approximation and complexity into the process.
See  and

So the fix for dealing with the bias in Type 1 error control due to early stopping/peaking CAUSES bias in the estimated treatment effects, which, presumably, are also of importance to you and your organization.

The Waiting Game

However, if all we do is just wait –  c’mon, it’s not that hard 😉 – and run the test after we collect our data based on the pre-specified sample size, and in weekly or monthly blocks, then we don’t have to deal with any issues of selection bias or biased treatment effects. This is one of those cases where just doing the simplest thing possible gets you the most robust estimation and risk control.

What if I have more than one treatment?

If you have more than one treatment you may want to adjust your Type 1 error control to ‘know’ that you will be making several tests at once. Think of each test as a lottery ticket. The more tickets you buy, the greater the chance you will win the lottery, where ‘winning’ here means making a Type 1 error.

The chance of making a single Type 1 error over all of the treatments is called the Familywise Error Rate (FWER). The more tests, the more likely you are to make a Type 1 error at a certain confidence level (alpha). I won’t get into the detail here, but to ensure that the FWER is not greater than your alpha, you can use any of the following methods:  Bonferroni; Sidak; Dunnetts etc. Bonferroni is least powerful (in the Type 2 error sense), but is the simplest with the least assumptions, so a good safe bet, esp if Type 1 error is a very real concern. One can argue which is best, but it will depend, and for just a handful of comparisons it won’t really matter what correction you use IMO.

Another measure of familywise error is the False Discovery Rate (FDR) (See “>  To control for FDR, you could do something like the Benjamini–Hochberg procedure.  While controlling for the FDR means a more powerful test (less Type 2 error), there is no free lunch, and it is at the cost of allowing more Type 1 errors. Because of this, researchers often use the FDR as a first step to screen out possibly interesting treatments in cases where there are many (thousands) of independent tests. Then, from the set of significant results, more rigorous follow up testing occurs.  Claims about preference around controlling for either FDR or FWER are really implicit statements about relative risk of Type 1 and Type 2 error.

Wrapping it up

The whole point of the test is to control for risk – you don’t have to run any tests to get estimates of treatment effects, or a measure of uncertainty around those effects. However, it is often is a good idea to control for these errors, so the more you understand their relative costs, the better you can determine how much you are willing pay to reduce the chances of making them. Rather then look at the sample size question as a hassle, perhaps look at it as an opportunity for you and your organization to take stock and discuss what the goals, assumptions, and expectations are for the new user experiences that are under consideration.


Posted in Uncategorized | Leave a comment

What is the value of my AB Testing Program?

Occasionally we are asked by companies how they should best assess the value of running their AB testing programs. I thought it might be useful to put down in writing some of the points to consider if you find yourselves asked this question.

With respect to hypothesis tests, there are two main sources of value:
1) The Upside – reducing Type 2 error. 
This is implicitly what people tend to think about in Conversion Rate Optimization (CRO) – the gains view of testing. When looking to value testing programs, they tend to ask something along the lines of ‘What are the gains that we would miss if we didn’t have a testing program in place?’ One possible approach to measure this is to reach back into the testing toolkit and create a test program control group.  The users assigned to this control group are then shielded from any of  the changes made based on outcomes from the testing process. This control group is then used to estimate a global treatment effect for the bundle of changes over some reasonable time horizon (6 months, a year etc.)  The calculation looks something like:

Total Site Conversion – Control Group Conversion – cost of the testing program.

You can think of this as a sort of meta AB Test.

Of course, in reality, this isn’t going to be easy to do, as forming a clean global control group will often be complicated, if not impossible, and determining how to value the lift over the various possible conversion measure each individual test may have used can be tricky – especially in non commerce applications.

2) The Downside – mitigating Type 1 loss.
However, if we only consider the explicit gains from our testing program, we ignore another reason for testing – the mitigation of Type 1 errors. Type 1 errors are changes in behaviors that lead to harm, or loss. To estimate the value of mitigating this possible loss, we would need to expose to our control group the changes that we WOULD have made on the site had they not been rejected by our testing. That means that we would need to make changes to the meta control group’s experiences that we have strong reason to think would harming, and degrade their experience. Of course this is almost certainly a bad idea, let alone potentially unethical, and it highlights why certain types of questions are not amenable to randomized controlled trials (RCT) – the backbone of AB Testing.

(Anyone out there using instrumental variables or other counterfactual methods? Please comment if you are).

(for a refresher on Type 1 and Type 2 errors please see

But even if we did go down this route (bad idea), it still doesn’t get us a proper estimate of the true value of testing, since even if we don’t encounter harmful events, we still were protected against them.  For example, you may have collision car insurance but have had no accidents over the past year. What was the value of the collision insurance, zero? You sure?  The value of insurance isn’t equal to the amount that ultimately gets paid out. Insurance doesn’t work like that – it is good that is consumed regardless if it pays out or not. What you are paying for is the reduction in downside risk – and that is something that testing provides regardless if the adverse event occurs or not.  The difficult part for you is to assess the probability (risk or maybe Knightian uncertainty), and severity of the potential negative results.

The main take away is that the value of testing is in both optimization (finding the improvements); and in mitigating downside risk. To value the latter, we need to be able to price what is essentially a form of insurance against whatever the org considers to be intolerable downside risk. It is like asking what is the value of insurance, or privacy policies, or security policies. You may get by in any given year without them, but as you scale up, the risks of downside events grow, making it more and more likely that a significant adverse event will occur.

One last thing. Testing programs tend to jumble together the related, but separate concepts of hypothesis testing, the binary decision of  Reject/Fail to Reject the outcome, with the estimation of effect sizes, the best guess for the ‘true’ population conversion rates.  I mention this because often we just think about the value of the actions taken based the hypothesis tests, rather than also considering the value of robust estimates of the effect sizes for forecasting, ideation, and for helping allocate future resources  (as an aside, one can run an experiment that has a robust hypotheses test, but also yields a biased estimate of the effect size (magnitude error). [Sequential testing I’m looking at you!]

Ultimately, testing can be seen as both a profit (upside discovery) AND cost (downside mitigation) center.  Just focusing on one will lead to underestimating the value your testing program can provide to the organization.  That said, it is a fair question to ask, and one that hopefully will help lead to extracting even more value from your experimentation efforts.

What are your thoughts? Is there any thing we are missing, or should consider? Please feel free to comment and let us know how you value your testing program.

Posted in Uncategorized | Leave a comment

Do No Harm or AB Testing without P-Values

A few weeks ago I was talking with Kelly Wortham during her excellent AB Testing webinar series.  During the conversation, one of the attendees asked if they just wanted to pick between A and B, did they really need to run standard significance tests at a 90% or 95% confidence levels?

The simple answer is no.  In fact, in certain cases, you can avoid dealing with p-values (or priors and posteriors) altogether and just pick the option with the highest conversion rate.

Even more interesting, at least to me, is that simple approach can be viewed as either form of classical hypothesis testing or as an epsilon- first solution to the multi-arm bandit problem.

Before we get into our simple testing trick, it might be helpful to first revisit a few important concepts that underpin why we are running tests in the first place.

The reason we run experiments is to help determine how different marketing interventions will affect the user experience.  The more data we collect, the more information we have, which reduces our uncertainty about the effectiveness of each possible intervention.  Since data collection is costly, the question that always comes up is ‘how much data do I really need to collect?’


In a way, every time we run an experiment, we are trying the balance the following: 1) the cost of making a Type 1 error; 2) the cost of making a Type 2 error; and 3) the cost of data collection to help reduce our risk of making either of these errors.
To help answer this question, I find it helpful to organize optimization problems into two high level classes of problems:


1) Do No Harm
These are problems where there is either:
  1. an existing process, or customer experience, that the organization is counting on for at least a certain minimum level of performance.
  2. a direct cost associated with implementing a change.
For example, while it would be great if we could increase conversions from an existing check out process, it may be catastrophic if we accidentally reduced the conversion rate. Or, perhaps we want to use data for targeting offers, but there is a real direct cost we have to pay in order to use the targeting data.  If it turns out that there is no benefit to the targeting, we will incur the additional data cost without any upside, resulting in a net loss.
So, for the ‘Do No Harm’ type of problem we want to be pretty sure that if we do make a change, it won’t make things worse. For these problems we want to stay the current course unless we have strong evidence to take an alternative action.


2) Go For It
In the ‘Go For It’ type of problem there often is no existing course to follow.  Here we are selecting between two or more novel choices AND we have symmetric costs, or loss, if we make a Type I error (reviewed below).

A good example is headline optimization for news articles.  Each news article is, by definition, novel, as are the associated headlines.  Assuming that one has already decided to run headline optimization (which is itself a ‘Do No Harm’ question), there is no added cost, or risk to selecting one or the other headlines when there is no real difference in the conversion metric between them. The objective of this type of problem is to maximize the chance of finding the best option, if there is one. If there isn’t one, then there is no cost or risk to just randomly select between them (since they perform equally as well and have the same cost to deploy).  As it turns out, Go For It problems are also good candidates for Bandit methods.

State of the World
Now that we have our two types of problems defined, we can ask under what situations we might find ourselves when we finally make a decision (i.e. select ‘A’ or ‘B’).  There are two possible states of the world when we make our decisions:
  1. There isn’t the expected effect/difference between the options
  2. There is the expected effect/difference between the options
It is important to keep in mind that in almost all cases we won’t be entirely certain what the true state of the world is, even after we run our experiment (you can thank David Hume for this).  This is where our two error types, Type I and Type II come into play. You can think of these two error types as really just two situations where our Beliefs about the world are not consistent with the true state of the world.
A Type I error is when we pick the alternative option (the ‘B’ option), because we mistakenly believe the true state of the world is ‘Effect’.  Alternatively, a Type II error is when we pick ‘A’ (stay the course), thinking that there is no effect, when the true state of the world is ‘Effect’.


The difference between the ‘Do No Harm’ and ‘Go For It’ problems is in how costly it is to make a Type I error.
The table below is the payoff matrix for each error for ‘Do No Harm’ problems
Payoff: Do No Harm      The True State of the World (unknown)
Decision Expected Effect No Expected Effect
Pick A Opportunity Costs No Cost
Pick B No Opportunity Cost Cost
Notice, that if we pick B when there is no effect, we make a Type I error and suffer a cost.  Now lets look at the payoff table for the ‘Go For It’ problem.
Payoff: Go For It           The True State of the World (unknown)
Decision Expected Effect No Expected Effect
Pick A Opportunity Costs No Cost
Pick B No Opportunity Cost No Cost

Notice that the payoff tables for Do No Harm and Go For It are the same when the true state of the world is that there is an effect.  But, they differ when there is no effect. When there is no effect, there is NO relative cost in selecting either A or B.

Why is this way of thinking about optimization problems useful? 
Because this can help with what type of approach to take based on the problem.
In the Do No Harm problem we need to be mindful about Type I errors, because they are costly, so we need to factor in the risk of making them when we design our experiments.  Managing this risk is exactly what classical hypothesis testing does.
That is why for ‘Do No Harm’ problems, it is best practice to run a classic, robust, AB Test.  This is because we care more about minimizing our risk of doing harm (the cost of Type I error) than any benefit we might get from rushing through the experiment (cost of information).

However, it also means that if we have a ‘Go For It’ problem, if there is no effect, we don’t really care how we make our selections.  Picking randomly when there is no effect is fine, as each of the options have the same value.  It is this case where our simple test of just picking the highest value option makes sense.

Go For It: Tests with no P-Values

Finally we can get to the simple, no p-value test.  This test guarantees that if there is a true difference of the minimum discernible effect (MDE), or larger, one will choose the better-performing arm X% of the time, where X is the power of the test.
Here are the steps:

1) Calculate the sample size
2) Collect the data
3) Pick whichever option has the highest raw conversion value. If a tie, flip a coin.

Calculate the sample size almost exactly the same as in a standard test:  1) pick a minimum detectable effect (MDE) – this is our minimum desired lift; 2) select the power of the test.

Ah, I hear you asking ‘What about the alpha, don’t we need to select a confidence level?’ Here is the trick. We want to select randomly when there is no effect. By setting alpha to 0.5, the test Reject the null 50% of the time when null is true (no effect).

Lets go through a simple example to make this clear.  Lets say your landing page tends to have a conversion rate of around 4%.  You are trying out a new alternative offer, and a meaningful improvement for your business would a lift to a 5% conversion rate. So the minimum detectable effect (MDE) for the test is 0.01 (1%).

We then estimate the sample size needed to find the MDE if it exists. Normally, we pick an alpha of 0.05 , but now we are instead going to use an alpha of 0.5.  The final step is to pick the power of the test, lets use a good one, 0.95 (often folks pick, 0.8, but for this case we will use 0.95).

You can use now use your favorite sample size calculator (for Conductrics users this is part of the set up work flow).

If you use R, this will look something like:

power.prop.test(n = NULL, p1 = 0.04, p2 = 0.05, sig.level = 0.5, power = .95,
alternative =”one.sided”, strict = FALSE)

This gives us a sample size of 2,324 per option, or 4,648 in total.  If we were to run this test with a confidence of 95% (alpha=0.05) would need to have almost four times the traffic, 9,299 per options, or 18,598 in total.

The following is a simulation of 100K experiments, were each experiment selected each arm 2,324 times.  The conversion rate for B was set to 5% and 4% for A. The chart below plots the difference in the conversion rates between A and B.  Not surprisingly, it is centered on the true difference of 0.01.  The main thing to notice, is that if we pick the option with the highest conversion rate we pick B 95% of the time, which is exactly the power we used to calculate the sample size!

Notice – no p-values, just a simple rule to pick whatever is performing best, yet we still get all of our power goodness! And we only needed about a fourth of the data to reach that power.
Now lets see what our simulation looks like when both A and B have the same conversion rate of 4% (Null is true).
Notice that the difference between A and B is centered at ‘0’, as we would expect. Using our simple decision rule, we pick B 50% of the time and A 50% of the time.
Now, if we had a Do No Harm problem, this would be a terrible way to make our decisions because half the time we would select B over A and incur a cost. So you still have to do the work and determine your relative costs around data collection, Type 1, Type II errors.
While I was doing some research on this, I came across Georgi Z. Georgiev’s Analytics tool kit. It has a nice calculator that lets you select your optimal risk balance between there three factors. He also touches on running tests with an alpha of 0.5 in this blog post. Go check it out.

What about Bandits?

As I mentioned above, we can also think of our Go For It problems as a bandit.  Bandit solutions that first randomly collect data and then apply the ‘winner’ are known as epsilon-first (To be fair, all AB Testing for decision making can be thought of as Epsilon-first). Epsilon stands for how much of your time you spend during the data collection phase.  In this way of looking at the problem, the sample size output from our sample size calculation (based on MDE and Power), is our Epsilon – how long we let the bandit collect data in order to learn.
What is interesting, is that at least in the two option case, this easy method gives us roughly the same results an adaptive Bandit method will.  Google has a nice blog post on Thompson Sampling, which is a near optimal way to dynamically solve bandit problems. We also use Thompson Sampling here at Conductrics, so I thought it might be interesting to compare their results on the same problem.
In one example, they run a Bandit with two arms, one with a 4% conversion rate, and the other with a 5% conversion rate – just like our example. While they show the Bandit performing well, needing only an average of 5,120 samples, you will note that that is still slightly higher than the fixed amount we used (4,648 samples) in our super simple method.
This doesn’t mean that we don’t ever want to use Thompson Sampling for bandits. As we increase the number of possible options, and many of those options are strictly worse than the others, running Thompson Sampling or another adaptive design can make a lot of sense. (That said, by using a multiple comparison adjustments, like the Šidák correction, I found that one can include K>2 arms in the simple epsilon-first method and still get Type 2 power guarantees. But, as I mentioned, this becomes a much less competitive approach if there are arms that are much worse than the MDE.)

The Weeds

You may be hesitant to believe that such a simple rule can accurately help detect an effect.  I checked in with Alex Damour, the Neyman Visiting Assistant Professor over at UC Berkeley and he pointed out that this simple approach is equivalent to running a standard t-test of the following form. From Alex:

“Find N such that P(meanA-meanB < 0 | A = B + MDE) < 0.05. This is equal to the N needed to have 95% power for a one-sided test with alpha = 0.5.

Proof: Setting alpha = 0.5 sets the rejection threshold at 0. So a 95% power means that the test statistic is greater than zero 95% of the time under the alternative (A = B + MDE). The test statistic has the same sign as meanA-meanB. So, at this sample size, P(meanA – meanB > 0 | A = B + MDE) = 0.95.”
To help visualize this, we an rerun our simulation, but run our test using the above formulation.
Under the Null (No Effect) we have the following result
We see that the T-scores are centered around ‘0’.  At alpha=0.5, the critical value will be ‘0’.  So any T-score greater than ‘0’ will lead to a ‘Rejection’ of the null.
If there is an effect, then we get the following result.
Our distribution of T-scores is shifted to the right, such that only 5% of them (on average) are below ‘0’.

A Few Final Thoughts

Interestingly, at least to me, is that the alpha=0.5 way to solve the ‘Go For It’ problems straddles two of our main approaches in our optimization toolkit.  Depending how you look at it, it can be seen as either: 1) A standard t-test (albeit one with the critical value set to ‘0’); or 2) as an epsilon-first approach to solve a multi-arm bandit.
Looking at it this way, the  ‘Go For It’ way of thinking about optimization problems can help bridge our understanding between the two primary ways of solving our optimization problems. It also hints that as one moves away from Go For It into Do No Harm (higher Type 1 costs), perhaps classic, robust hypothesis testing is the best approach. As we move toward Go For It, one might want to rethink the problem as a multi-arm bandit.
Have fun, don’t get discouraged, and remember optimization is HARD – but that is what makes all the effort required to learn worth it!
Posted in Uncategorized | 5 Comments

Thompson Sampling or how I learned to love Roulette

Multi-armed bandits, Bayesian statistics, machine learning, AI, predictive targeting blah blah blah. So many technical terms, morphing into buzz words, that it gets confusing to understand what is going on when using these methods for digital optimization.  Hopefully this post will give you a basic idea of how adaptive learning works, at least here at Conductrics, without getting stuck in the AI hype rabbit hole.

Forget Dice, Let’s Play Roulette

tl;dr  Select higher value options more often than lower valued ones, and adjust the relative frequency depending on the user.

What is the difference between basic AB Testing and Adaptive selection?  Adaptive selection differs from AB Test selection by dynamically weighing the probability of selection to favor the higher value decision options. This has the effect of selecting options with higher predicted values more often, and selecting the lower values options less often.  In AB Testing each experience has an equal chance to be selected (or if not equal, the chance of selection has been fixed by the test designer).

To help visualize this, imagine a roulette wheel. Under a “fair” random policy, the roulette wheel assigns an equal area to each option.  For three options, our roulette wheel looks like this:

We spin the roulette wheel and select the option that the wheel stops on. Because each option has an equal share of the wheel, each option has an equal chance to be selected from each spin.

In adaptive/predictive mode, however, Conductrics increases the probability of selecting the options that have higher predicted conversion values, and lowers the probability of options with lower predicted conversion values. This is equivalent to Conductrics constructing a biased roulette wheel to spin for the selections. For example, the roulette wheel below is biased.

If we were to spin this wheel repeatedly, option ‘A’ would, on average, be selected 67% of the time, option ‘B’ 8%, and option ‘C’ only 25%.
Of course, this begs the question, ‘how does Conductrics decide how to weight each option?’

From Bayesian AB Testing to Bandits: How Conductrics calculates the probabilities

Conductrics uses a modified version of Thompson Sampling to make adaptive selections. Thompson sampling is an efficient way to make selections based on the probability that an option is the best one. These selection probabilities are analogues to the areas we assign to the roulette wheel.

Before we take draws using the Thompson sampling method, we first need to construct a probability distribution over the possible conversion values for each option. How do we do that?
For those of you who already know about Bayesian AB Testing, this section will be familiar. Without getting to much into the weeds (we will skip over the use and selection of the prior distribution), we will estimate both the conversion values and a measure of uncertainty (similar to standard error), and construct a probability distribution based on those values. For example, lets say we were using conversion rate as a goal, and we had two options, Grey and Blue, both with a 5% conversion rate. However, for the Grey option we have 3,000 samples, but for the Blue one we only have 300.  The predicted conversion rate of the Grey option has less uncertainty than the Blue one because we have 10 times more experience with it.

The uncertainty of our estimates are encoded in the width, or spread of our distribution of the predicted values. Notice that the values for Grey are clustered mostly between 4% and 6%, whereas for Blue, while also centered at 5%, has a spread between 2% and 8%.

Now let’s say that rather than having the same average conversion rate, we estimate a conversion rate of 5.5% for our Blue option. Now our distribution for Blue has been shifted slightly to right – centered at 5.5%, rather than at 5%.

Just by looking at the two distributions, we can see that while Blue has a higher estimated conversion rate, there is so much uncertainty in Blue (as well as some in Grey), that we can’t really be too certain which option really will be the best one going forward.

One simple approach to get an idea of how likely Blue will be better than than Grey, is to take many pairs of random draws from each distribution, compare the results, and then mark how often the result from Blue was greater than Grey.

You can think of Grey and Blue’s conversion rate distributions (those curves in the chart above), as custom random number generators.  You can use them just like you would call Excel’s rand() function. Except unlike rand(), where we get back values uniformly distributed between zero and one, calling Dist(Grey) we get back values that are near 0.05, plus or minus a small amount. If we call Dist(Blue), we will get back values near 0.055,  plus or minus a larger amount than Grey – reflecting the greater uncertainty in Blue’s predicted conversion rate.

If we called Dist(Grey) and Dist(Blue) 10,000 times, we would get a good idea of how much more likely Blue is better than Grey – in this case Blue comes up higher approx. 62% of the time. We could then conclude, based on the data so far, that Blue has an approximately 62% chance of being the higher valued option and Grey has a 38% chance to be the best option.

Advanced: Priors and AB Testing

If you have experience with Bayesian AB Testing you may be wondering about the selection of the prior distribution. Conductrics uses an approximate empirical Bayesian approach to estimate the prior distributions. The choice of prior can lead to different results than those found in many online testing calculators (which tend to use a uniform prior, whereas Conductrics uses the empirical grand mean for shrinkage/regularization). However, the basic approach is similar to the calculation in many Bayesian AB Testing reports (please see our posts on Shrinkage if you would like to learn more ).

Note: While there is reasonable debate on the best approach for AB Testing, Conductrics recommends the use of error-statistical methods (e.g. t-tests) when the objective is statistical error control (see: Mayo ).  Consider applying Bayesian methods, for example the use of posterior distributions discussed in this post, for when the objective is estimation and prediction (of course, regularization and shrinkage aren’t just the purview of Bayesian statistics, so use what you like and works for your problem 🙂 ).

Thompson Sampling and Bandit Selection

Rather than just report on how likely each option is the best one, we can modify this approach to make the selections for each user.  This modified approach is called Thompson sampling. The way it works is to take random draws from each of the option’s associated distribution. This is just as we did before, but instead of marking down the option with the highest drawn value, we instead go ahead and select that option and serve it to the user. That is all there is to it.

To make this clearer, lets take a look at an example. Below we have three options, each with their associated posterior distributions over their predicted conversion value. Thompson Sampling implicitly makes draws based on the probability that each is best by taking random draws from each of the three options, and then selecting the option with the highest valued draw.

Lets now take a random draw from the distributions of each option. Notice that even though ‘A’ has a higher predicted value (‘A’ is shifted to the right of the both ‘B’ and ‘C’), some of the left hand side of the distribution overlaps with the right hand sides of both ‘B’ and ‘C’. Since the width of the distribution indicates our uncertainty around the predicted value (the more uncertain, the wider the distribution), it isn’t inconceivable that either options B or C are the best.

In this case we see that our draw from ‘B’ happened to have the highest value (0.51 vs. 0.49 and 0.46). So we would select ‘B’ for this particular request.

This process is repeated with each selection request. If we were to get another request we might get draws such that ‘A’ was best.

Over repeated draws the frequency that each option is selected will be consistent with the probability that the option is the best one.

In our example we would select ‘A’ approximately 67% of the time, ‘B’ 8% of the time, and ‘C’ 25% of the time. Our roulette selection wheel looks like this:

As we collect more data we can update our distributions to reflect the information in the new data. For example, if as new data streams in, option ‘A’ continues to have higher conversion rates, then we would recreate our wheel to reflect this (note: warnings about confounding apply).

Where can I see these results in the Reporting?

What is neat is that Conductrics provides reporting and data visualizations based on data from your own experiments and adaptive projects.
If you head over to the audience report you will be see by default, both the predicted value of each option, but also, represented by the heights of the bars, the probability that the selection is best. For example, the report below shows that after 10,000 visitors, option ‘A’ has a predicted value of 4.96% and a 67% change to be a better option than both ‘B’ and ‘C’.

But wait, there is more. We even provide a data visualization view that displays the posterior distributions that Conductrics uses in the Thompson Sampling draws.

This is the same data as before, but with a slightly different view. Now, instead of the bar chart showing the chance that each option is the best, there are the actual distributions that are used to make the adaptive selections. This view can be useful to help understand why, even though an option has a higher predicted value than the others, it still won’t always be selected. The width of the distributions gives you an idea of the relative uncertainty in the data around the predicted values, and the greater degree of overlap indicates how much uncertainty there is between which option to select.

Adaptive Selection with Predictive Targeting

Conductrics also uses the same Thompson Sampling approach for targeting. When Conductrics determines that a user feature or trait is useful, it recalculates both the predicted values as well as the probabilities that each option is best. In the example below visitors on Apple Devices have a much higher predicted value under the ‘A’ option than either the ‘B’ or ‘C’ options. So much so that if Conductrics was to select options for users on Apple Devices, the ‘A’ option would always get selected.

However, for users not on Apple Devices, option ‘A’ performs much worse that either option ‘B’ or ‘C’, and hence, will not be selected (‘A’ Best-Pick Prob=0%).
You will notice that you still have access to the ‘Everyone’ audience, so that you can see how well each option performs overall, and what the overall probability of selection would be without any targeting.
If you would like, you can also see the visualization of the posterior distributions for each option, in each audience.

Now that you understand Thompson sampling, its obvious why for Apple Device users option ‘A’ has essentially a 100% chance of being selected. Even if we were to always pick pessimistically from the very left hand side (low value side) from option ‘A’, and always picked optimistically from the right hand side (high value side) from both ‘B’ and ‘C’, option ‘A’ would still, essentially, always have a higher value.
For visitors not on Apple Devices, the reverse is true. Even if we were to use the most optimistic draws from option ‘A’ and the most pessimistic draws from ‘B’ and ‘C’, option ‘A’ would still produce draws that were less than those from ‘B’ and ‘C’.  As an aside, pruning out very poor options is often where much of the value of adaptive sections comes from.

If Thompson sampling and posterior distributions are tripping you up, you can also just think of it as Conductrics creating a different roulette wheel for each audience. When a member of that audience enters into an agent, Conductrics will first check which wheel to use, and then spin that wheel in order to make a selection for them. So based on the audience report for Apple and non Apple Device users, the logic would look like:

We can apply this logic to any number of audiences, such that there can be 10, 20, 100 etc. such audiences, each with their own custom roulette wheel.

In a way, when you peel back all of the clever machine learning, and multi-armed bandit logic, all Conductrics is doing is constructing these targeted roulette wheels, and using them to select experiences for your users.

One last thing: Uniform Selection

Even when the agent is set to make adaptive selection, there is still a portion of the traffic that is assigned by our fair roulette wheel.  The reason for dong this is twofold:

  • 1) To ensure a random baseline. In order to evaluate the efficacy of adaptive learning, it is important to ensure there is an unbiased comparison group to compare the results against.
  • 2) To ensure that Conductrics still tries out all of the options in case there is some external change, or seasonality effect, that results in a change in the conversion rate/value of the decision options.

One last, last thing:

There are actually many different ways to tackle adaptive selection. Along with Thompson Sampling (TS), we have also used UCB methods, Boltzmann, simple e-greedy (those with sharp eyes will have noticed that what I have described in the post should be called an e-Thompson method), and even mixtures of them.   In terms of performance, all of them are probably fine to use. We found that TS works better for us because it:

  1. uses similar ideas as Bayesian AB Testing, hence it is perhaps easier to understand by our clients than some of the other methods (“What, what?! What is this temperature parameter?”).
  2. fits gracefully with the approximate empirical Bayesian methods that Conductrics uses in its machine learning.
  3. is simple(ish) to implement.

So if you are using some other method, and it works, then great.  Assuming you are paying attention to the possibility of non-stationary data and confounding, I am not convinced that for most use cases there is much extra juice to squeeze between one method or another.

‘* I actually don’t like gambling and am a total buzz kill at casinos.

Posted in Uncategorized | Leave a comment

Going from AB Testing to AI: Optimization as Reinforcement Learning

In this post we are going to introduce an optimization approach from artificial intelligence: Reinforcement Learning (RL).

Hopefully we will convince you that it is both a powerful conceptual framework to organize how to think about digital optimization, as well as a set of useful computational tools to help us solve online optimization problems.

Video Games as Digital Optimization

Here is a fun example of RL from Google’s Deepmind. In the video below, an RL agent learned how to play the classic Atari 2600 game of Breakout. To create the RL agent, Deepmind used a blend of Deep Learning (to map the raw game pixels into a useful feature space) and a type of RL method called Temporal-Difference learning.

The object of Breakout is to remove all of the bricks in the wall by bouncing the ball off of a paddle while ensuring that you don’t miss the ball and let it pass the paddle (lose a life).  The only control, or action, to take is to move the paddle left or right.  Notice that at first, the RL agent is terrible. It is just making random movements left of right. However, after only a few hours of play it begins to learn, based on the position of the ball, how to move the paddle to take out the bricks.  After even more play, the RL agent learns a higher level strategy to remove bricks such that ball can pass through behind the wall, a more efficient way to clear the wall of bricks. This higher level strategy emerges from the agent factoring in the long term effects from each decision on how to move the paddle.

This is the same type of behavior that we want to learn for our multi-touch optimization problems. We want to learn not just what the direct (or last touch) effects are, but also the longer term impact across the set of relevant marketing touch-points.

AB Testing as Building Block to AI

To make sure we are all on the same page, lets start with something we are already familiar with from optimization, AB Testing.  AB Testing is often used as a way to decide which is the best experience to present to your customers so that they have the highest probability to achieve a particular objective.   In order to smooth the transition from AB testing to thinking about RL, it’s going to be helpful to think about AB testing as having the following elements:

  1. A touch-point – the place, or state, where we need to make the decision, for example on a web page;
  2. A decision – this is the set of possible competing experiences to present to the customer (the ‘A’ and ‘B’ in AB Testing); and
  3. A payoff or objective – this is our goal, or reward, it is often an action we would like the customer to take (buy a product, sign-up etc.)

In the image below we have a decision on a web page where we can either present ‘A’ or ‘B’ to the customer, and the customer can either convert or not convert after being exposed to either ‘A’ or ‘B’.

So far, nothing new.  Now, lets make this a little more complicated.  Instead of making a decision at just one touch-point, let’s add another page where we want to add another set of experiences to test. We now have the following:

From each of the pages we see that some of the customers will convert (represented by the path through the green circle with ‘$$$’ signs) before exiting the site, while others will directly exit the site without converting (represented by the direct paths to the ‘Exit Site’ node).  We could just treat this as two separate AB tests, and just evaluate the conversion from each of our options (‘A’ and ‘B’ on Page 1, and ‘C’ and ‘D’ on Page 2).

Attribution = Dynamics

However, as we go from one touch-point to multiple touch-points we add additional complexity to our problem.  Now, not only do we need to keep track of how often users convert before exiting the site after each of our touch-points, but we also need to account for how often they transition from one touch-point to another.

Here, the red lines represent the users that transition from one touch-point to another after being exposed to one of our treatment experiences.  These transitions from touch-point to touch-point represent how our experiments are not only affecting the conversion rates, but also how they affect the larger dynamics of our marketing systems.

Accounting for the impact of these changes to the user dynamics is the attribution problem. Attribution is really about accounting for the non-direct impact of our marketing interventions when we no longer just have single decisions, but a system of sequential marketing decisions to consider.

Reinforcement learning

One simple way to handle this is to combine both tests into one multivariate test. In our example, we would have four treatment options: 1) ‘AC’; 2) ‘AD’; 3) ‘BC’; and 4) ‘BD’. However, since not all users will wind up going to each touch-point, users that we assign to  ‘AC’ for example, will really be comprised of three groups of users, those exposed to ‘AC’, ‘Aω’, and ‘ωC’, where ‘ω’ represents a null decision.   This inefficiency is because this approach doesn’t easily let us take into account the sequential and dynamic nature of our problem, since users may not even wind up being exposed to certain treatments based on how they flow through the process.

This is where reinforcement learning can help us.  Realizing that we have sequential decisions, we can recast our AB testing problem as a Reinforcement Learning Problem. From Sutton and Barto:

“Reinforcement learning is learning what to do—how to map situations to
actions—so as to maximize a numerical reward signal. The learner is not
told which actions to take, as in most forms of machine learning, but instead
must discover which actions yield the most reward by trying them. In the most
interesting and challenging cases, actions may affect not only the immediate
reward but also the next situation and, through that, all subsequent rewards.
These two characteristics—trial-and-error search and delayed reward—are the
two most important distinguishing features of reinforcement learning. ” (Page 4

Mapping situations to actions so as to maximize reward by trial and error learning is the marketing optimization problem. RL is so powerful, not only as a machine learning approach, but because it gives us a concise and unified framework to think about experimentation, personalization, and attribution.

Coordinated Bandits through TD-Learning

In our two touch-point problem we have the standard conversion behavior, just like in AB Testing. In addition, we have the transition behavior, where a user can go from one touch-point to another. What would make our attribution problem much easier to solve is if we could just treat the transition behavior  like the standard conversion behavior. I like to think of this as a sort of lead-gen model, where each touch point can either try to make the sale directly or pass the customer on to another touch-point in order to close the sale.  Just like any other lead -en approach, each decision agent needs to communicate with the others, and credit the leads that are sent its way.

What is cool is that we can use a similar approach that Deepmind uses to treating the transition events like conversion events. This will let us hack our AB Testing, or bandit approach, to solve the multi-touch, credit attribution problem.


One version of TD-Learning, which is what Deepmind used for Breakout, is Q-Learning.  Q-learning uses both the explicit rewards (e.g. the points after removing a brick for example) along with an estimated value of transitioning from touch-point to a new touch-point.

A simple version of the Q-learning (TD(0)) conversion reward looks like:

Reward(t+1)+γ∗Maxa Q(s(t+1),at ).

Don’t let the math throw you.  In words,

Reward(t+1) is just the value of conversion event after the user is exposed to a treatment. This is exactly the same thing we measure when we run an AB Test.

Maxa Q(s(t+1),at ) – this is a little trickier, but actually not too tricky. This bit is how we will calculate the long term, attribution calculation.  It just says find the highest valued option at the NEW touch-point, and use its value as the credit to attribute back to the selected action from the originating touch-point.

γ – this is really a technicality, it is just a discount rate (its the same type of calculation that banks use in calculating the present value of payments over time). For our example below we will just set γ=1 but normally γ is set between 0 and 1.

Lets run through a quick example to nail this down.  Lets go back to our simple two touch-point example.  On Page 1 we select either ‘A’ or ‘B’ and on Page 2 we select between ‘C’ or ‘D’.  For simplicity’s sake, lets say the first 10 customers go only to Page 1 during their visit.  Half of them get exposed to ‘A’ and the other half get exposed to ‘B’. Lets say three customers convert after ‘A’ and only one customer converts after ‘B’. Lets also assume a conversion is worth $1. Based on this traffic and conversions, the estimated values for Page1:A=$3.0/5, or $0.60, and for Page2:B=$1.0/5, or $0.20. So far nothing new. This is exactly the same types of conversion calculations we make all of the time with simple AB Tests (or bandits).

Now, lets say customer 11 comes in, but this time, the customer hits Page2 first.  We then randomly pick experience ‘C’. After being exposed to ‘C’, the customer, rather than converting, goes to Page 1. This is where the magic happens.  Page 2 now ‘asks’ Page 1 for a reward credit for sending it a customer lead. Page 1 then calculates the owed credit as the value of its highest valued option, which is ‘A’, with a value of $0.60.   The value of Page2:C is now equal to $0.60/1, since remember the TD-Reward is calculated as Reward(t+1)+γ∗Maxa Q(s(t+1),at ) = 0 + 1*0.60 = 0.60.
So now we have:
Page2:D=0.0 (no data yet)

Lets say customer 11, now that they are on Page1, is exposed to ‘A’, but they don’t convert and leave the site.  We then update the value of Page1:A, $3/6=$0.50.  What is interesting is that even though customer 11 didn’t convert, we still were able to increase the value of Page2:C, which makes sense, since in the long term, getting a user to Page1 is worth something (certainly more than ‘0’ as would be the case with a first click attribution method).

While there is more detail, this is mostly all there is to it. By continuously updating the values of each option in this way, our estimates will tend to converge towards the true long term values.

What is awesome is that RL/TD-Learning lets us: 1) blend Optimization with Attribution; 2) calculate a version of time-decay attribution, but only needing to use an augmented version of a last click approach; and 3) interpret the transitions from one touch-point to another as just a sort of intermediate, or internal conversion event.

In a follow up post, we will cover how to include Targeting, so that we can learn the long term value of each touch-point option/action by customer.

If you would like to learn more, please review our Datascience Resources 2  blog post. If you would like to learn more about Conductrics please feel free to reach out to us.

Posted in Uncategorized | 1 Comment

Machine Learning and Human Interpretability

The key idea behind Conductrics is that marketing optimization is really a reinforcement learning problem, a class of machine learning problem, rather than an AB testing problem. Framing optimization as a reinforcement learning problem allowed us to provide, from the very beginning, not just AB and multivariate testing tools, but also multi-armed bandits, predictive targeting, and a type of multi-touch decision attribution using Q-learning.  Our first release back in 2010 was in many ways more advanced than what the rest of the industry is providing today.

However, we discovered that no matter how accurate the machine learning might be, many clients and potential clients, were uncertain about ceding control, even of low risk aspects of the user experience, to automated systems if they couldn’t understand how the machine learning was making decisions.

This led us to reconsider what is a good ML solution, especially for customer facing applications, and to redesign our machine learning engine to both solve for accuracy and also be human interpretable.

What is Machine Learning?

Machine learning can be thought of as sitting at the intersection of computer science and statistics. ‘Computer Science has focused primarily on how to manually program computers, Machine Learning focuses on the question of how to get computers to program themselves …’ – Tom Mitchel (see: ).

For many tasks it makes relatively little difference if these programs are opaque to human introspection. We are almost exclusively concerned with performance for some task (‘is this a cat?’; how to win at Donkey Kong, etc.). Here, high capacity models, like deep learning, suffer little penalty for marginal increases in representational complexity.

However, for several valid reasons, marketers tend to be wary about ceding control of their customers’ experiences to black box methods. Firstly, they need assurances that they can trust automation to make reasonable decisions over a large space of possible environments. Secondly, companies often have internal and legal regulatory requirements around how, and which, data may be used for making marketing decisions.  With the EU’s GDPR coming into effect in May 2018, this will be even more of a concern.

Why Human Understanding matters

1)  Trust – People often need to understand something before they can trust it.  This is not only true for marketers, but true of ML diagnostics systems for Doctors.  Doctors are much more likely to consider the recommendation/decision from an automated system if the reasoning can be explained.

2) Insights –  Not only do people need to trust these systems, but they also want to be able to glean insights from them – ‘I didn’t realize this type of offer would convert better on the weekends’.

3) Review and Accountability – will this decision from our machine learning be consistent with our stated data policies? Can we be sure this is fair, or non discriminatory? Is this even Legal?

4) Explanation – can you communicate and explain the prediction, or decisions, to users if they ask.  You may be thinking, ‘who cares if you can explain it to users’? Well, as if May 2018, the General Data Protection goes into effect in the EU. Not only does this change the regulations of how personal data can be used, stored etc., it also places regulation on automated decision systems based on Machine Learning.

Under Article 22 of the GDPR, EU citizens will have the right of explanation for any ‘significant’ decision that was made via an automated ML system. While there is still ambiguity around what ‘significant’ will mean, and what will be a satisfactory explanation,  the main take away is that Automated Decisions are contestable.   The penalty for being in breach is up to 4% of a companies worldwide gross revenue.  So it is a huge risk.

Human readable machine learning representation

While lacking the expressiveness, or capacity, of more complex representations, sparse decision trees have many appealing properties for the marketing use cases:

1)Human Readable – You can see based on the features of the users, what action / or output the system will take.

2)Discrete Rule set – the rules cover all possible use cases. This is useful for organizational review/legal before deploying since one can predetermine all outcomes for all inputs.

3)Loggable – since the decision policy is represented as rules, rather than as a function, as in Neural nets (deep or shallow) or regression, each decision can be logged with the exact policy/rule that was used. This lets the organization recall for customer support, or for a GDPR challenge for explanation,  the exact reason for any particular decision that was made.

Conductrics’ Machine Learning Rules

The following is a simple example of a tree view for a set of decision rules that select between three possible user experiences (the ‘A’, ‘B’, and ‘C’ experiences are represented as bars in the audience nodes) generated by Conductrics’ predictive targeting engine:

The model begins with everyone in a single group (the Everyone Audience on the far left) – as you would with simple AB testing. Then, one by one, Conductrics finds the user features that are most useful to discriminate between the user experiences (tech note: unlike standard tree algorithms, that build decision rules from the data, Conductrics unpacks a sparse set of rules that are implicitly represented within a predictive model (function approximation)).  This is very similar to the game of 20 questions you may have played as a child.  Conductrics tries to ‘ask’ the fewest questions possible, while still finding the best solution. In this case, we only need to include three user features: Rural or not; Mobile Device or not; and Registered or not.

It is also useful to see each targeted audience side by side, with additional details, such as audience size, as well as the predicted values of each option and the probability that each option is the best one.


When Conductrics is making automated decisions, it will follow the rules implied by this report, which as a program looks something like:
if ( Rural ) { return A; }
else if ( Mobile ) { return A; }
else if ( Registered ) { return C; }
else { return B; }

But how do you know if the targeting rules are working?  Well, we can evaluate our little targeting program by running a type of AB Test, but where our predictive program is one of our test’s options. By default, Conductrics will take a random sample to use as a baseline to compare against the predictive targeting rules. For example, our targeting program returned about $2.77 per user, whereas a simple random play of the four base options returned just $2.10, for a lift of 31.4%


Conductrics also includes each of the individual options under random selection, which can each also be used as baseline (just keep in mind to specify sample sizes beforehand).

Using the decision tree representation for the predictive control logic, enables the machine learning to be both machine and human readable. As we see ever increasing demand for customer facing applications that embed AI and machine learning automation, expect to see a greater focus on the human interpretability of these systems.

If you would like to learn more about how we use ML at Conductrics, please get in touch.


Posted in Analytics, Testing and Data Science, Uncategorized | Leave a comment

Conductrics 3.0 Release

Today I am happy to announce Conductrics 3.0, our third major release of our universal optimization platform. Conductrics 3.0 represents the next generation of personalized optimization technology, blending experimentation with machine learning to help deliver the best customer experiences across every Marketing channel. Conductrics 3.0 highlights include:

Conductrics Express – You asked and we listened. While many of you were happy using our APIs, agencies and front-line marketers often wanted a more robust point-and-click tool to help set up tests and personalized experiences for web sites. Conductrics 3.0 introduces Express, a novel, self-hosted WYSIWYG test creation tool that lets non-developers easily create advanced web optimization campaigns. Following our deep belief that long-term optimization requires Marketing and IT to work in harmony, we have designed Express so that it can be self-hosted, without the need for third party tags, allowing IT to ensure proper Q/A and management of the overall operation of your digital properties.

Interpretable AI – Conductrics 3.0 was designed so that clarity is now a core element of our machine learning, and automation algorithms.  Our new platform converts complex optimization logic into easily digestible, human readable decision rules. Conductrics is not alone in applying machine learning to optimize clients’ Marketing applications. However, to be truly effective, marketers must also be able to quickly understand the who and why of predictive analytics. This will be especially true in European markets covered under the upcoming GDPR regulations.

New Flavor of our API – Conductrics has provided APIs as a web service since our first release, back in 2010. Conductrics 3.0 adds an even-faster JavaScript API to our classic web service API. Now it is even easier to integrate across almost any channel, either server or client side (or both) with Conductrics’ APIs.

We have been looking forward to this day for a while now, but what we are most excited about is to see all of the original and amazing experiences you will discover and provide for your customers.  We can’t wait to be part of that journey with you. Please reach out if you would like to learn more or just chat about the future of optimization.

For more on the Conductrics 3.0 platform features, visit

Posted in Uncategorized | Leave a comment