## Segmentation and Shrinkage

In our last post, we introduced the idea of shrinkage. In this post we are going to extend that idea to improve our results when we segment our data by customer.

Often what we really want is to discover what digital experience is working best for each customer.  A major problem is that as we segment our customers into finer and finer audiences, we have less and less data to use for each segment. In a way segmentation is the opposite of pooling -we want to analyze the data in small chunks, rather than analyzing it as one big blob.

In the previous post we showed how partial-pooling can help improve our estimates of individual items by adjusting them back toward a pooled, or global mean.

We noted that this is especially effective when we don’t have much data to estimate each individual mean.

## Segmentation: Unpooled

For our segmentation use case we will keep it simple and assume we know just two things about our users. They are either a new or repeat customer, and they either live in a rural or suburban neighborhood.

We want to estimate the conversion rate for each test option in our AB Test for each of the four possible user segments (Repeat+Suburban; Repeat+Rural; New+Suburban; New+Rural). This is going to give us eight total conversion rates – two conversion rates for each segment.  So you can see, even in a super simple case, we already have a bunch of individual means we need to estimate.

In our little scenario, the null hypothesis is actually going to be true, so there is no difference in conversion rate between A and B.  In addition, neither of the user features (New/Repeat; Suburban/Rural) will have any impact on conversion rate – they are going to be useless.

We set the conversion rate for both A and B to be 20% (0.2) and then draw 100 random samples and we get the following results:

The unsegmented conversion rates are A=16.7% and B=23.1%.  If we just stopped at 100 observations, we would find that B has a 78% probability of being better than A. Not conclusive by any means, but might be enough to lead some astray in to concluding that B is best.

When we segment our results, we find the conversion rates are all all over the place:
1) Repeat+Suburban A=10.4%, B=17.2%; Probability B is best 72%
2) Repeat+Rural A=17.1%, B=31.7%; Probability B is best 88%
3) New+Suburban A=17.2%, B=15.8%; Probability B is best 49%
4) New+Rural A=23.9%,B=30.3%; Probability B is best 51%

Just looking at these numbers we might be tempted to think that Rural customers have higher conversion rates and that option B is probably best for Repeat+Rural customers .

Remember, the real conversion rate for both options, A and B, regardless of segment is 20%.

## Segmentation: Partial Pooling

Now lets rerun our simulation, but this time we will calculate the partial pooled values by shrinking our option results back to the grand/pooled average.

Notice that our unsegmented A and B estimates are pulled toward 20%, with A=20.6% and B=19.2%. After 100 trials, the probability that A is best is just 60%.

We can add another level of partial pooling by shrinking each of the segment results back toward the respective option mean:
1) Repeat+Suburban A=20.4%, B=19.0%
2) Repeat+Rural A=20.7%, B=19.2%
3) New+Suburban A=20.5%, B=19.2%
4) New+Rural A=20.8%,B=19.4%

Now, each of the options for each segments have a 50%/50% probability to be the best, which is what we would expect given that there really is no difference in the conversion rates.

Just to recap this section. We first shrunk our unsegmented A and B estimates toward the average conversion rate over both options.  Then we used these partial-pooled unsegmented option values as a grand average for the segments, and used them for shrinking the values for the segments.

## Shrinkage, Targeting, and Multi-Armed Bandits

Where this really gets useful is when you are running a targeted optimization problem with an adaptive method, such as contextual bandits/RL with function approximation.

With multi-armed bandits, rather selecting options with an equal probability, as we do with AB testing, we can instead make our selections based on the probability that each option is best.  So for example, based on the unpooled results we would select B 88% and A 22% of the time for Repeat+Rural customers.   In this use-case it doesn’t really matter, but as you can imagine in a real world scenario,  any intelligent optimization algorithm will need to sort through hundreds of possible segments, option combinations.  What we don’t want, is to confuse the learning system with all of the noise that is bound to creep in to the results when there are so many combinations under consideration.  By feeding the bandit algorithm the partial pooled estimates, rather than the unpooled, we are able to stabilize the optimization process in early, most uncertain portion of the learning.

Maybe even more valuable, is when we introduce new options into an ongoing optimization effort.  So lets say we added a ‘C’ option.  By using shrinkage, we automatically get a reasonable initial estimate for our new option. We just shrink it to the current grand mean of the problem, which is almost certainly a better initial guess then zero. This way the new option has a better chance of being selected by our bandit method (where we make draws from the posterior distributions).

## Prediction, Pooling, and Shrinkage

As some of you may have noticed, there are often little skirmishes that occasionally break out in digital testing and optimization. There are the AB test vs multi-armed bandits debate (both are good, depending on task), standard vs multivariate testing (same, both good), and the Frequentist vs. Bayesian testing argument (also, both good). In the spirit of bringing folks together, I want to introduce the concept of shrinkage. It has Frequentist and Bayesian interpretations, and is useful in both sequential, Bayesian, and bandit style testing. It is also useful for building predictive models that work better in the real world.

Before we get started talking about shrinkage, lets first step back and revisit our coffee example from our post on P-Values (http://conductrics.com/pvalues).  In one of our examples, we wanted to know which place in town has the hotter coffee: McDonald’s or Starbucks.  In order to answer that, we needed to estimate the average temperatures and variances in temperature for both McDonalds and Starbucks coffees. After we calculated the temperature of our samples of cups of coffee from each store, we got something like this:

(Perhaps because McDonalds got burned (heh) in the past for serving scalding coffee they are now serving lower average coffee temps than Starbucks.)

In most A/B testing situations we calculate unpooled averages. Unpooled means that we calculate a separate set of statistics based only on the data from each collection.  In our coffee example, we have two collections – Starbucks and McDonald’s. Our unpooled estimates don’t co-mingle the data between stores. One downside of unpooled averages is that each estimated average is based on just a portion of the data. And if you remember, the sampling error (how good/bad our estimate is of the true mean/average) is a function of  how much data we have. So less data means much less certainty around our estimated value.

## Data Pooling

We could ignore that the coffee comes from two different stores and pool the data together. Here is what that looks like:

The pooled estimate gives us the grand average temperature for any cup of coffee in town.  You may wonder why this pooled average is interesting to us, since what we really care about is finding the average coffee temperature at each store.

As it turns out we can use our pooled average to help us reduce uncertainty,  in order to get better estimates for each store, especially when have only a small amount of data with which to make our calculations. This often is the case when we have many different collections that we need to come up with individual estimates for.

## Partial Pooling

You can think of partial pooling as a way of averaging together our pooled and our unpooled estimates.  This lets us share information across all of the collections, while also being able to estimate individual results.  By using partial pooling, we get the pooled benefit of tapping into all of the coffee data to calculate our store level coffee temperatures, and the unpooled advantage of having a unique temperature value for each store. Since we are effectively averaging the pooled and unpooled values together, this has the effect of pulling each estimated value back toward each other – toward the grand/pooled mean.

When we take our average of the pooled and unpooled values, we factor in how much data we have for each component.  So it is a weighted average. If we have lots of data on the unpooled portion, then we weigh the unpooled estimate more, and the pooled value has very little impact. If, on the other hand, we have very little data for the unpooled estimate, than the pooled value makes up a larger share of our partial pooled estimate.

The pooled value is like the gravitational center of your data. Collections with sparse data don’t have much energy to pull away from that center and get squeezed, or shrunk toward it.

However, as we get more data on each option, the partial pooled values will start to move away from the center value and move toward their unpooled values.

Partial pooling can improve our estimates when we have lots of separate collections we want to estimate and when the collections are in some sense from the same family.

Imagine that instead of two or three coffee shops, we had 50 or even 100 places that sell coffee. We might only be able to sample a handful of cups from each store. Because of our small samples, our unpooled estimates will all be very noisy, with some having very high temps and some very low temps. By using partial pooling we smooth out the noise and pull all of the outliers back toward the grand average.  This tugging-in of our individual results toward the center – called shrinkage – is a tradeoff of bias for less variance (noise) in our results.

If you are not convinced, here is another example. Let’s say we wanted to know the what the average temperature of a cup of coffee is from the town’s local cafe. Even before we test our first cup of coffee, we already have a strong evidence that the temperature will be somewhere around 130 degrees, since the coffee temp of the independent cafe belongs to the family of all coffee shop coffee temperatures. If you then tested a cup of coffee from the cafe, and found it was only 90 F, you might think that while the cafe’s average temperature might be less than average coffee temperature, it is probably higher than 90 F.

## Partial Pooling and Optimization

We can use this same idea when we are run sequential, empirical Bayesian, or bandit style tests. By using partial pooling, we help stabilize our predictions in the early stages of a campaign. It also allows us to make reasonable predictions to new options added mid-test.

Shrinking the individual means back to the grand mean is also consistent with the null hypothesis from standard testing – we assume that all of the different test options are drawn from the same distribution.  Partial pooling implicitly includes that assumption directly into the estimated values.  However, as we collect more data, each result moves away from the center. Partial pooled results sit along the continuum between pooled and unpooled estimates. What is nice, is that the data determines where we are on the continuum.

What to learn more about partial pooling and are a Frequentist?  Look up random effects models. If you are Bayesian, please see hierarchical modelling.  And if you are cool with either, see empirical Bayes and James-Stein estimators.

## Easy Introduction to AB Testing and P-Values

A version of this post was originally published over at Conversion XL

For all of the talk about how awesome (and big, don’t forget big) Big data is, one of the favorite tools in the conversion optimization toolkit, AB Testing, is decidedly small data. Optimization, winners and losers, Lean this that or the other thing, at the end of the day, A/B Testing is really just an application of sampling.

You take couple of alternative options (eg. ‘50% off’ v ‘Buy One Get One Free’ ) and try them out with a portion of your users.  You see how well each one did, and then make a decision about which one you think will give you the most return.  Sounds simple, and in a way it is, yet there seem to be lots of questions around significance testing. In particular what the heck the p-value is, and how to interpret it to help best make sound business decisions.

These are actually deep questions, and in order to begin to get a handle on them, we will need to have a basic grasp of sampling.

## A few preliminaries

Before we get going, we should quickly go over the basic building blocks of AB Testing. I am sure you know most of this stuff already, but can’t hurt to make sure everyone is on the same page:

The Mean – often informally called the average. This is a measure of the center of the data. It is a useful descriptor, and predictor, of the data, if the data under consideration tends to clump near the mean AND if the data has some symmetry to it.

The Variance – This can be thought of as the average variability of our data around the mean (center) of the data.  For example, consider we collect two data sets with five observations each: {3,3,3,3,3} and {1,2,3,4,5}.  They both have the same mean (its 3) but the first group has no variability, whereas the second group does take different values than its mean. The variance is a way to quantify just how much variability we have in our data.  The main take away is that the higher the variability, the less precise the mean will be as a predictor of any individual data point.

The Probability Distribution – this is a function (if you don’t like ‘function’, just think of it as a rule) that assigns a probability to a result or outcome.  For example, the roll of a standard die follows a uniform distribution, since each outcome is assigned an equal probability of occurring (all the numbers have a 1 in 6 chance of coming up).  In our discussion of sampling, we will make heavy use of the normal distribution, which has the familiar bell shape.  Remember that the probability of the entire distribution sums to 1 (or 100%).

The Test Statistic or Yet Another KPI
The test statistic is the value that we use in our statistical tests to compare the results of our two (or more) options, our ‘A’ and ‘B’.   It might make it easier to just think of the test statistic as just another KPI.  If our test KPI is close to zero, then we don’t have much evidence to show that the two options are really that different. However, the further from zero our KPI is, the more evidence we have that the two options are not really performing the same.

Our new KPI combines both the differences in the averages of our test options, and incorporates the variability in our test results. The test statistics looks something like this:

While it might look complicated, don’t get too hung up on the math. All it is saying is take the difference between ‘A’ and ‘B’ – just like you normally would when comparing two objects, but then shrink that difference by how much variability (uncertainty) there is in the data.

So, for example, say I have two cups of coffee, and I want to know which one is hotter and by how much. First, I would measure the temperature of each coffee. Next, I would see which one has the highest temp. Finally, I would subtract the lower temp coffee from the higher to get the difference in temperature. Obvious and super simple.

Now, let’s say you want to ask, “which place in my town has the hotter coffee, McDonald’s or Starbucks?” Well, each place makes lots of cups of coffee, so I am going to have to compare a collection of cups of coffee. Any time we have to measure and compare collections of things, we need to use our test statistics.

The more variability in the temperature of coffee at each restaurant, the more we weigh down the observed difference to account for our uncertainty. So, even if we have a pretty sizable difference on top, if we have lots of variability on the bottom, our test statistic will still be close to zero. As a result of this, the more variability in our data, the greater an observed difference we will need to get a high score on our test KPI.

Remember, high test KPI -> more evidence that any difference isn’t just by chance.

## Always Sample before you Buy

Okay now that we have that out of the way, we can spend a bit of time on sampling in order to shed some light on the mysterious P-Value.

For sake of illustration, let say we are trying to promote a conference that specializes in Web analytics and Conversion optimization.  Since our conference will be a success only if we have at least certain minimum of attendees, we want to incent users to buy their tickets early. In the past, we have used ‘Analytics200’ as our early bird promotional discount code to reduce the conference price by \$200.  However, given that AB Testing is such a hot topic right now, maybe if we use ‘ABTesting200’ as our promo code, we might get even more folks to sign up early. So we plan on running an AB test between our control, ‘Analytics200’ and our alternative ‘ABTesting200’.

We often talk about AB Testing as one activity or task. However, there are really two main parts of the actual mechanics of testing.

Data Collection – this is the part where we expose users to either ‘Analytics200’ or ‘ABTesting200’. As we will see, there is going to be a tradeoff between more information (less variability) and cost.  Why cost? Because we are investing time and foregoing potentially better options, in the hopes that we will find something better than what we are currently doing. We spend resources now in order to improve our estimates of the set of possible actions that we might take in the future. AB Testing, in of itself, is not optimization. It is an investment in information.

Data Analysis – this is where we select a method, or framework, for drawing conclusions from the data we have collected. For most folks running AB Tests online, it will be the classic null significance testing approach. This is the part where we pick statistical significance, calculate the p-values and draw our conclusions.

## The Indirect logic of Significance Testing

Sally and Bob are waiting for Jim to pick them up one night after work.  While Bob catches a ride with Jim almost every night, this is Sally’s first time. Bob tells Sally that on average he has to wait about 5 minutes for Jim.  After about 15 minutes of waiting, Sally is starting to think that maybe Jim isn’t coming to get them.  So she asks Bob, ‘Hey, you said Jim is here in 5 minutes on average, how often do you have to wait 15 minutes?’  Bob, replies, ‘don’t worry, with the traffic, it is not uncommon to have wait this long or even a bit longer. I’d say based on experience, a wait like this, or worse, probably happens about 15% of the time.’ Sally relaxes a bit, and they chat about the day while they wait for Jim.

Notice that Sally only asked about the frequency of long wait times.  Once she heard that her observed wait time wasn’t too uncommon, she felt more comfortable that Jim was going to show up.  What is interesting is what she really wants to know is the probability that Jim is going to stand them up.  But this is NOT what she learns.  Rather, she just knows, given all the times that Jim has picked up Bob, what is the probability is of him being late more than 15 minutes.  This indirect, almost contrarian, logic is the essence of the p-value and classical hypothesis testing.

## Back to our Conference

For the sake of argument, let’s say that the ‘Analytics200’ promotion has a true conversion rate of 0.1, or 10%.  In the real world, this true rate is hidden from us – which is why we go and collect samples in the first place –  but in our simulation we know it is 0.1. So each time we send out ‘Analytics200’, approximately 10% sign up.

If we go out and offer 50 prospects our ‘Analytics200’ promotion we would expect, on average, to have 5 conference signups. However, we wouldn’t really be that surprised if we saw a few less or a few more.  But what is a few? Would we be surprised if we saw 4? What about 10, or 25, or zero?   It turns out that the P-Value answers the question, How surprising is this result?

Extending this idea, rather than taking just one sample of 50 conference prospects, we take 100 separate samples of 50 prospects (so a total of 5,000 prospects, but selected in 100 buckets of 50 prospects each).  After running this simulation, I plotted the results of the 100 samples (this plot is called a histogram) below:

Our simulated results ranged from 2% to 20% and the average conversion rate of our 100 samples was 10.1% – which is remarkably close to the true conversion rate of 10%.

### Amazing Sampling Fact Number 1

The mean (average) of repeated samples will equal the mean of the population we are sampling from.

### Amazing Sampling Fact Number 2

Our sample conversion rates will be distributed roughly according to a normal distribution – this means most of the samples will be clustered around the true mean, and samples far from our mean will occur very infrequently.  In fact, because we know that our samples are distributed roughly normally, we can use the properties of the normal (or students-t) distribution to tell us how surprising a given result is.

This is important, because while our sample conversion rate may not be exactly the true conversion rate, it is more likely to be closer to the true rate than not.  In our simulated results, 53% of our samples were between 7 and 13%. This spread in our sample results is known as the sampling error.

Ah, now we are cooking, but what about sample size you may be asking? We already have all of this sampling goodness and we haven’t even talked about the size of each of our individual samples. So let’s talk:

There are two components that will determine how much sampling error we are going to have:

• The natural variability already in our population (different coffee temperatures at each Starbucks or McDonald’s)
• The size of our samples

We have no control over variability of the population, it is what it is.

However, we can control our sample size. By increasing the sample size we reduce the error and hence can have greater confidence that our sample result is going to be close to the true mean.

### Amazing Sampling Fact Number 3

The spread of our samples decreases as we increase the ‘N’ of each sample.  The larger the sample size, the more our samples will be squished together around the true mean.

For example, if we collect another set of simulated samples, but this time increase the sample size to 200 from 50, the results are now less spread out – with a range of 5% to 16.5%, rather than from 2% to 20%. Also, notice that 84% of our samples are between 7% and 13% vs just 53% when our samples only included 50 prospects.

We can think of the sample size as a sort of control knob that we can turn to increase or decrease the precision of our estimates.  If we were to take an infinite number of our samples we would get the smooth normal curves below. Each centered on the true mean, but with a width (variance) that is determined by the size of each sample.

### Why Data doesn’t always need to be BIG

Economics often takes a beating for not being a real science, and maybe it isn’t ;-). However, it does make at least a few useful statements about the world. One of them is that we should expect, all else equal, that each successive input will have less value than the preceding one. This principle of diminishing marginal returns is at play in our AB Tests.

Reading right to left, as we increase the size of our sample, our sampling error falls. However, it falls at a decreasing rate – which means that we get less and less information from each addition to our sample.   So in this particular case, moving to a sample size of 50 drastically reduces our uncertainty, but moving from 150 to 200, decreases our uncertainty by much less. Stated another way, we face increasing costs for any additional precision of our results. This notion of the marginal value of data is an important one to keep in mind when thinking about your tests. It is why it is more costly and time consuming to establish differences between test options that have very similar conversion rates.  The hardest decisions to make are often the ones that make the least difference.

Our test statistic, which as noted earlier, accounts for both how much difference we see between our results and for how much variably (uncertainty) we have in our data. As the observed difference goes up, our test statistic goes up. However, as the total variance goes up, our test statistic goes down.

Now, without getting into more of the nitty gritty, we can think of our test statistic essentially the same way we did when we drew samples for our means.  So whereas before, we were looking just at one mean, now we are looking at the difference of two means, B and A.  It turns out that our three amazing sampling facts apply to differences of means as well.

Whew- okay, I know that might seem like TMI, but now that we have covered the basics, we can finally tackle the p-values.

## Assume There is No Difference

Here is how it works. We collect our data for both the ABTesting200, and Analytics200 promotions. But then we pretend that really we ran an A/A test, rather than an A/B test. So we look at the result as if we just presented everyone with the Analytics200 promotion.  Because of what we now know about sampling, we know that both groups should be centered on the same mean, and have the same variance – remember we are pretending that both samples are really from the same population (the Analytics200 population). Since we are interested in the difference, we expect that on average, that Analytics200-Analytics200 will be ‘0’, since on average they should have the same mean.

So using our three facts of sampling we can construct how the imagined A/A Test will be distributed, and we expect that our A/A test, will on average, show no difference between each sample. However, because of the sampling error, we aren’t that surprised when we see values that are near zero, but not quite zero. Again, how surprised we are by the result is determined by how far away from zero our result is. We will use the fact that our data is normally distributed to tell us exactly how probable seeing a result away from zero is. Something way to the right of zero, like at point 3 or greater will have a low probability of occurring.

## Contrarians and the P-Value, Finally!

The final step is to see where our test statistic falls on this distribution. For many researchers, if it is somewhere between -2 and 2, then that wouldn’t be too surprising to see if we were running an A/A test. However, if we see something on either side of or -2 and 2 then we start getting into fairly infrequent results. One thing to note: what is ‘surprising’ is determined by you, the person running the test.  There is no free lunch, at the end of the day, your judgement is still an integral part of the testing process.

Now lets  place our test statistic (t-score, or z-score etc) on the A/A Test distribution. We can then see how far away it is from zero, and compare it to the probability of seeing that result if we ran an A/A Test .

Here our test statistic is in the surprising region.  The probability of the surprise region is the P-value. Formally, the p-value is the probability of seeing a particular result (or greater) from zero, assuming that the null hypothesis is TRUE.  If ‘null hypothesis is true’ is tricking you up, just think instead, ‘assuming we had really run an A/A Test.

If our test statistic is in the surprise region, we reject the Null (reject that it was really  an A/A test). If the result is within the Not Surprising area, then we Fail to Reject the null.  That’s it.

## Conclusion: 7 Points

Here are a few important points about p-values that you should keep in mind:

• What is ‘Surprising’ is determined by the person running the test. So in a real sense, the conclusion of the test will depend on who is running the test.  How often you are surprised is a function of how high a p-value you need to see (or related, the confidence level in a Pearson-Neyman approach, eg. 95%) for when you will be ‘surprised’.
• The logic behind the use of the p-value is a bit convoluted and contrarian. We need to assume that the null is true, in order to evaluate the evidence that might suggest that we should reject the null. This is kinda of weird and an evergreen source of confusion.
• It is not the case that the p-value tells us the probability that B is better than A. Nor is it telling us the probability that we will make a mistake in selecting B over A. These are both extraordinarily commons misconceptions, but they are false.  This is an error that even ‘experts’ often make, so now you can help it explain it to them ;-).  Remember the p-value is just the probability of seeing a result or more extreme given that the null hypothesis is true.
• While many folks in the industry will tout classical significance testing as some sort of gold standard, there is actually debate in the scientific community about the value of p-values for drawing testing conclusions. Along with Bergers’ paper below, also check out Andrew Gelman’s blog for frequent discussions around the topic. http://andrewgelman.com/2013/02/08/p-values-and-statistical-practice/
• You can always get a higher (significant) p-value. Remember that the standard error was one part variation in the actual population and one part sample size. The population variation is fixed, but there is nothing stopping us, if we are willing to ‘pay’ for it, to keep collecting more and more data. The question really becomes, is this result useful. Just because a result has a high p-value (or is statistically significant in the Pearson-Neyman approach) doesn’t mean it has any practical value.
• Don’t sweat it, unless you need to. Look, the main thing is to sample stuff first to get an idea if it might work out. Often the hardest decisions for people to make are the ones that make the least difference. That is because it is very hard to pick a ‘winner’ when the options lead to similar results, but since they are so similar it probably means there is very little up or downside to just picking one.  Stop worrying about getting it right or wrong.  Think of your testing program more like a portfolio investment strategy. You are trying to run the bundle of tests, whose expected additional information will give you the highest return.
• The p-value is not a stopping rule. This is another frequent mistake. In order for all of the goodness we get from sampling that lets us interpret our p-value, you select your sample size first. Then you run the test. (There are some that advocate the use of Wald’s sequential tests (SPRT, or similar), but these are not robust in the presence of nonexchangeable data- which is often the case in online settings.)

This could be another entire post or two, and it is a nice jumping off point for looking into the multi-arm bandit problem (see Conductrics http://conductrics.com/balancing-earning-with-learning-bandits-and-adaptive-optimization/

*One final note: What makes all of this even more confusing is that there isn’t just one agreed upon approach to testing. For more check out Berger’s paper for a comparison of the different approaches http://www.stat.duke.edu/~berger/papers/02-01.pdf and Baiu et .al http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2816758/

## Predictive Targeting: Managing Complexity

Personalization, one to one, predictive targeting, whatever you call it. Serving the optimal digital experience for each customer is often touted as the pinnacle of digital marketing efficacy. But if predictive targeting is so great, why isn’t everyone doing it right now?

The reason is that while targeting can be incredibility valuable, many in the industry haven’t fully grasped that targeting ALWAYS leads to greater organization complexity, and that greater complexity means greater costs.

Complexity
Informally, complexity describes both the number of individual parts, or elements, a system has, as well as how those parts interact with one another. Transitioning your marketing system (web site, call center, etc.) into one that can perform targeting will increase the number of elements the system needs to keep track of and increase the ways in which these elements interact with one another.

Targeted Marketing: The Elements
First, let me give you a quick definition of targeted marketing.  By targeting, I mean delivering different experiences to customers based on attributes of that customer. There are three main requirements that a marketing system will need to address in order to deliver targeted customer experiences :
1) Customer data. Nothing new here, this is what we all tend to think of when think about data driven customer experiences. However, we have additional issues to consider with targeting that we don’t need to deal with when only using the data for analytics. We need to:

1) ensure that we have accurate user data; AND
2) that it is available to our marketing systems at decision time.

That means that not only do we need to source the data, but we also need to ensure its quality and set up the processes that are able to deliver the relevant data at the time that the marketing system needs to select an experience for the user.

2) The User Experiences (differentiated content, different flows, offers, etc): Anyone who has done any AB testing will tell you that the hard, costly part isn’t running or analyzing the tests. It is the creation and maintenance of multiple versions of the content. Even if there is no marginal cost to create additional experiences (say we are doing a pricing test, it doesn’t cost more to display \$10.99, than \$20.99), we still need to be able to manage and account for each of these options.

3) Targeting Logic – This is a new concept. In order to link #1 and #2 above we need a rule set, or logic, that links the customer to the best experience. A set of instructions that tells our marketing system that if it sees some particular customer, it should select some particular experience. Most of the time the conversation around personalization and targeting is about how we come up with the targeting logic. This is where all of the talk about predictive analytics and machine learning etc comes in.  But we need to consider that once we have done that work, and come up with our targeting logic, we still need to integrate the targeting logic into our marketing system.

In this way, the targeting logic should be thought of as a new asset class, along with the data and experience content. And like the data and experiences, the targeting logic needs to be managed and maintained – it too is perishable and needs refreshing and/or replacement.

In the pre-targeting marketing system, we don’t really need to deal with either #1 or #3 above to serve customers experiences, since there is really just one experience – all customers get the one experience.

Obscuring Introspection
However, in the targeted marketing system, we not only have these extra components, we also need to realize that these elements all interact, which significantly increases the complexity.  For example, let us say you have a customer who contacts us with a question about a digital experience they had.  In the pre-targeting world, it was fairly easily to determine what the customer’s experience was. With a targeted marketing system, what they experienced is now a function of both their ‘data’ and the targeting logic that were both active AT THE TIME of the experience. So whereas before, introspection was trivial, it is now extraordinarily difficult, if not impossible in certain contexts to discover what experience state the customer was in.  And this difficulty only increases as the system complexity is a function of the cardinality (number of options) of both the customer data and experiences – the finer the targeting the greater the complexity.

This is important to consider when you are thinking about what machine learning approach you use to induce the targeting logic.  While there has been a lot of excitement around Deep Learning, the resulting models are incredibly complex, making it very difficult for a human to quickly comprehend how the model will assign users to experiences.

Data Ethics
This can be a big issue when you need to assess the ethics of the targeting/decision logic.  While the input data may be innocent, it is possible the output of the system is in someway infringing on legal or regularity constraints. In the near future, it is entirely possible that automated logic will need to be evaluated for ethical/legal review. Human interpretable logic will be more likely to pass review, and help to instill confidence and acceptance of their host systems.

It is one of the reasons we have spent a large amount of our research here at Conductrics on coming up with algorithms that will produce targeting rules that are both machine as well as human consumable.

Marginal Value of Complexity
None of this is meant to imply that providing targeted and personalized experiences isn’t often well worth it. Rather it is to provide you with a framework for thinking about both the COSTS and the Benefits of targeting. This way you and your organization can ensure success once you do embark on predictive targeting by keeping this formula for the ROI of complexity in mind every step of the way:

In a way, you can think of your marketing system as one big computer program, that attempts to map customers to experiences in a way that you consider optimal. Without targeting, this program is relatively simple: it has fewer lines of code, it requires fewer data inputs to run, and you know pretty well what it is going to spit out as an answer. When you include targeting, your program will need many more lines of code, require a lot more data to run, and it may be very difficult to know what it will spit out as an answer.

So the question you need to spend some time thinking about before you answer, is if/when the complex program will be worth the extra cost than the simple one.

And once you do, please feel free to reach out to us to see how we can help make the transition as simple as possible.

Feel free to comment

## Big Data is Really About the Very Small

Awhile back I put together a fun list of the top 7 data scientists before there was Data Science. I got some great feedback on others that should be on the list (Tukey, Hopper, and even Florence Nightingale).

In hindsight I probably should have also included Edgar Codd. While at IBM, Codd developed the relational model for databases (data bank was the older term). You can see his 1970’s paper here

While the current excitement is around Big and unstructured data, most modern databases in commercial use today, along with the query language they use, SQL, are due to Codd’s work. Codd put together 12 or so rules to define when a database system can claim to be a relational database. One of these primary rules is the rule of Guaranteed Access. Essentially, the rule states that each unique data element has a unique address. This allows us, the user, to specify and select each atomic element and grab it if we want. Which, really, is the whole purpose of having a database.

The rule of Guaranteed Access got me thinking though. Maybe the current attempts to define Big Data in terms of quantity, or scale, or the value of data may be the wrong way to go about it. Rather than think about Big Data as a big collection of stuff, it may make more sense to think of Big Data as a capability we have about addressing and accessing data.

BIG DATA as a modified version of the Guaranteed Access Rule

“Big Data is the capacity to address and access any datum or event to any arbitrary scale or level of precision”

By framing it this way, we shed the ambiguous notions about quantity, size, and even value, and instead treat Big Data as a limiting process about our capacity to communicate with, and use, any present event or historical record of past events. In this alternative view, the metric for progression in the field stops being about the very bigness of data, but rather about the very smallness of the scale with which we can address and access data.

## Video Tutorial: Getting Started with Conductrics Web Actions

Conductrics is all about enhancing your website or other app by showing the best content variations for each visitor. There are basically two ways to use the service:

• As a developer, using our API and wrappers
• Conductrics Web Actions, which is our way to use the system without coding.

This video tutorial  focuses on the second option.

You’ll see how Web Actions makes it easy to:

• Create the content variations you want to try out on your pages, such as showing some portions of your page to some visitors and hiding them for other visitors. You can also do things like change headline text, swap images, insert new content, redirect some visitors to alternate landing pages, and so on.
• Set up reward triggers for your conversion events, so Conductrics can learn and report on which variations work best for which visitors.
• Target the variations to certain visitors based on geography, time, or your own custom data by setting up Content Targeting Rules.

If you want to try Web Actions out for yourself, get access by signing up at conductrics.com and check out how it works on your own pages. Thanks for watching!

## AB Testing: When Tests Collide

Normally, when we talk about AB Tests (standard or Bandit style), we tend to focus on things like the different test options, the reporting, the significance levels, etc.  However, once we start implementing tests, especially at scale, it becomes clear that we need a way to manage how we assign users to each test.  There are two main situations where we need to manage users:

1)      Targeted Tests – we have a single test that is appropriate only for some of users.

2)      Colliding Tests – we have multiple separate tests running, that could potentially affect each other’s outcomes.

## The Targeted Test

The most obvious reason for managing a test audience, is that some tests may not be appropriate for all users. For example, you might want to include only US visitors for an Offer test that is based on a US only product.  That is pretty simple to do with Conductrics. You just set up a rule that filters visitors from the specified test (or tests) if they do not have the US-Visit feature. The set up in the UI looks like this:

What this is telling Conductrics is that for the Offertest, if the user is not from the US, then do not put them into the test and serve them the default option.  Keep in mind that only US visitors will be eligible for the test, and will be the only users who will show up in the reporting.  If you really just want to report the test results for different types of users; you just run the test normally and include the targeting features you want to report over.

## Colliding Tests

Unless you are just doing a few one off tests, you probably have situations where you have multiple tests running at the same time. Depending on the specific situation you will need to make some decisions about how to control for how and when users can be exposed to a particular test.

We can use the same basic approach we used for US visitors, to establish some flow control over how users are assigned to multiple concurrent tests.

For example, perhaps the UX team wants to run a layout test, which will affect look and feel of every page on the site.  The hypothesis is that the new layout will make the user experience more compelling and lead to increased sales conversion on the site.

At the same time, the merchandising team wants to run an offer test on just a particular product section of the site. The merchandising team thinks that their new product offer will really incentivize users to buy and will increase sales in this particular product category.

## Strategy One: Separate Tests

The most common, and easiest, strategy is to just assume that the different tests don’t really impact one another, and just run each test without bothering to account for the other test.  In reality, this is often fine in many cases, especially if there is limited overlap.

## Strategy Two: The Multivariate Test.

We could combine both tests into one multivatiate test, with two decisions: Layout; Offer.  This could work, but,  once you start to think about it, maybe not the best way to go. For one, the test really only makes sense as a multivariate test if the user comes to the product section. Otherwise, they are never exposed to the product offer component of the test. Also, it assumes that both the UX and Merchandising teams plan to running  each test for the same amount of time. What to do if the UX team was only going to run the layout test for a week, but the merchandizing team was planning to run the Offer test for two weeks?

## Strategy Three: Mutually Exclusive Tests

Rather then trying to force what are really two conceptually different tests into one multivariate test, we can instead run two mutually exclusive tests.  There are several ways to set this up in Conductrics.  As an example, here is one simple way to make sure users are assigned to just one of our tests.

Since the Layout test touches every page on the site, lets deal with that first. A simple approach to keep some visitors free for the offer test is to randomly assign a certain percentage of users first to the layout test.  We can do that by setting up the following filter:

This rule will randomly assign 50% of the site’s visitors into the Layout test. The other 50% will not be assigned to the test (the % can be customized).  We now just need to set up a filter for the Offer test, that excludes visitors that have been placed into the Layout test.

This rule just says, exclude visitors who are in the layout test from the Offer test.  That’s it! Now you will be able to read your results without having to worry if another test is influencing the results.

What is neat, is that by combining these assignment rules you can customize you testing experience almost any way you can think of.

## Strategy Four: Multiple Decision Point Agents

These eligibility rules make a lot of sense for when we are running what are essentially separate tests. However, if we have a set of related tests – that are all working toward the same set of goals, we can instead use a multiple decision point agent.  With multi-point agents, Conductrics will keep track of both user conversions and when the user goes from one test to another.  Multi-point agents are where you can take advantage of Conductrics integrated conversion attribution algorithms to solve these more complex joint optimization problems. We will cover the Multi-Point agents separately and in detail in an upcoming post.

Thanks for reading and we look forward to hearing from you in the comments.

# Passive Models

This is an idea for scaling out certain data when transitioning to a highly clustered architecture.

TL;DR Don’t just read, mostly subscribe.

Ideally suited for data that is read often, but rarely written; the higher the read:write ratio, the more you gain from this technique.

This tends to happen to some types of data when growing up into a cluster, even if you have data that has a 2:1 ratio for a single server (a very small margin in this context, meaning it is read twice for every time it is written), when you scale it up, you often don’t get a 4:2 ratio, instead you get 4:1 because one of the two writes end up being redundant (that is, if you can publish knowledge of the change fast enough that other edges don’t make the same change).

With many workloads, such as configuration data, you are quickly scaling at cN:1 with very large c [number of requests served between configuration changes], meaning that real-world e-commerce systems are doing billions of wasted reads of data that hasn’t changed.  Nearly all modern data stores can do reads this like incredibly fast, but they still cost something, produce no value, and compete for resources with requests that really do need to read information that has changed.  For configuration data on a large-scale site, c can easily be in the millions.

So, this is an attempt to reign in this cN:1 scaling and constrain it to N:1; one read per node per write, so a 32-server cluster would be 32:1 in the worst-case, instead of millions to one.

## Pairing a Store with a Hub

defn: Hub – any library or service that provides a publish/subscribe API.

defn: Store – any lib/service that provides a CRUD API.

Clients use the Store’s CRUD as any ORM would, and aggressively cache the responses in memory. When a Client makes a change to data on the Store, they simultaneously publish alerts through the Hub to all other Clients. Clients use these messages to invalidate their internal caches. The next time that resource is requested, it’s newly updated version is fetched from the Store.

Since the messages broadcast through the Hub do not cause immediate reads, this allows bursts of writes to coalesce and not cause a corresponding spike in reads, but rather the read load experienced after a change is always the same, and based on the data’s usage pattern and how you spread traffic around your cluster.

To stick with the example of configuration data, let’s suppose the usage pattern is to read the configuration on every request, with a cluster of web servers load balanced by a round-robin rule. Suppose an administrative application changes and commits the configuration, it also invalidates the cached configuration on each web server through the Hub. Each subsequent request as the round-robin proceeds around the cluster will fetch an updated configuration directly from the Store. Load balancing rules that re-use servers, such as lowest-load, can have even higher cache efficiency.

From the perspective of the code using the Client, the writes made by others just seem to take a little bit longer to fully commit, and in exchange we never ask the database for anything until we know it has new information.

## Further Work

The Store layer requires aggressive caching, which requires that you constrain the CRUD to things where you can hash and cache effectively. Map/reduce is not allowed, etc., it really is best for an ORM-like scenario, where you have discrete documents, and use summary documents more than complicated queries.

## Improving the Promises API

The Promises API seems to be everywhere these days, and it really is great at solving one of JavaScript’s weaknesses: complex dependencies between asynchronous code.

For those new to promises, their most basic form is a queue of functions waiting for some asynchronous operation to complete.  When the operation is complete, it’s result is fed to all waiting functions.

TL;DR The core API of a Promise object should be:

.wait(cb)       # cb gets (err, result) later
.finish(result) # cb will get (undefined, result)
.fail(err)      # cb will get (err, undefined)

The Promises API’s true value comes from lifting some control out of the compiler’s hands, and into the hands of our runtime code using such a simple structure.  Now, rather than the syntax of the source code being the only description of the relationship between pieces of code (e.g. a callback pyramid), now we have a simple API for storing and manipulating these relationships.

In the widely used Promises/A , the API method .then() establishes such a relationship, but fails in a number of ways for me.

The word ‘then’ is given a second meaning, already being used in “if this then that”. If not literally in your language (CoffeeScript), then in your internal dialogue when you are reading and writing conditional expressions of all kinds, such as this sentence.

Also, ‘then’ is a very abstract word, becoming any one of three different parts of speech depending on how you use it.  Good API methods should be simple verbs unless there is a really good reason.

I find that people who are new to Promises take a long time to see their value, and this overloading of an already abstract word, as it’s core method, is part of the problem.

So let’s imagine a better API, for fun, made of simple verbs that tell you exactly what is happening.

Q: What is the core service that the Promise API should provide?

A: To cause some code to wait for other code to either finish or fail.

I suggest that wait is the most accurate verb for the action here, and communicates immediately why I would want to use promises… because I need some code to wait for the promise to finish.

Using ‘then‘ values the lyricism of the resulting code over it’s actual clarity, making it just a bit too clever.

Extensions to the API:

Many libraries add extensions for basic language statements, like assignment, delete, etc., but so far in my opinion this is just adding a function call and not really gaining anything, since these operations are never asynchronous.  In practical usage of promises to solve every day tasks, I would suggest some more pragmatic extensions based on common but difficult promises to make.

“I promise to [asynchronously] touch all the files” is an example of a hard promise to make currently, when each touch is asynchronous you don’t know which file is the last, or when they are all complete. What you need are incremental promises.

promise.progress(current, [maximum]) # emits 'progress' events
promise.finish(delta)                # calls .progress(current+delta)

“I promise to recurse over all directories”, is extra hard because you don’t even know the size of the goal at the start, and must update that knowledge recursively.

# only finish() this promise after promise b has finished
promise.include(promise_b)

This enables you to create promises that are both recursive and incremental, which lets you create a tree of promises to represent any workflow, without leaking knowledge to (or requiring it of) the waiting code.

I think the current Promises API has sliced the problem-space exactly right, but I think there are some pragmatic design choices one could make to get a better API at the end of the day.

## The World’s Top 7 Data Scientists before there was Data Science

I am often a bit late to the party and only recently saw Tim O’Reilly’s “The Worlds’ 7 most powerful Data Scientists”. As data science has become a big deal, there have been a several top data science lists that have been floating around.

So for fun, I thought I would put together my own list of the top data scientists before there was data science.  The people listed here helped unearth key principles on how to extract information from data.  While obviously important, I didn’t want to include folks whose contribution was mostly on the development of some particular approach, method, or technology.

To a large degree, the people on this list helped lay the foundation for a lot of what currently goes on as data science.  By studying what these guys* worked on, I think you can deepen the foundation of upon which your data science skills rest.  As a disclaimer, there are obviously way more than seven who made major contributions, but I wanted to riff on Tim’s piece, so seven it is.

So without further ado, on to the list:

1 Claude Shannon
I can’t imagine anyone arguing with putting C. Shannon on the list. Claude is often referred to as the father of information theory– which from my vantage point is Data Science, considering that information theory underpins almost all ML algorithms.  Claude Shannon came up with his groundbreaking work while at Bell labs (as an aside, this is also where Vapnik and Guyon worked when they came out with their ’92 paper on using the Kernel trick for SVMs – although interestingly, they didn’t use the term support vector machine. )
For a quick overview of Claude Shannon take a look here
And for his 1948 paper A Mathematical Theory of Communication go here

2. John Tukey
Tukey is hero to all of the data explorers in the field, the folks who are looking for the relationships and stories that might be found in the data. He literally wrote the book on Exploratory Data Analysis . I guess you can see his work as the jumping off point for the Big Data gang. Oh yeah, he also came up with a little something called the Fast Fourier Transformation (FFT).

3 Andrey Kolmogorov
A real Andrey the Giant, maybe not in the order of an Euler, but this guy had breadth for sure. He gets on the list for coming up with Algorithmic Complexity theory. What’s that? It’s just the use of Shannon’s information theory to describe the complexity of algorithms in computer science. For a CS layman’s read (me), I recommend Gregory Chaitin’s book, Meta Math.  For what its worth, I’d argue that a life well lived, is one that maximizes its Kolmogorov complexity.

4) Andrey Markov
Our second Andrey on the list, I had to give Markov the nod since we make heavy use of him here at Conductrics. Sequences of events (language, clicks, purchases, etc.) can be modeled as stochastic processes.  Markov described a class of stochastic process that is super useful for simply, but effectively modeling things like language, or attribution.  There are many companies and experts out there going on about attribution analysis, or braying about their simplistic AB testing tools, but if they aren’t at least thinking Markov, they probably don’t really know how to solve these problems.  The reality is, if you want to solve decision problems algorithmically, by optimizing over sequences of events, then you are likely going to invoke the Markov property (conditional independence) via Markov Chains or Decision Processes (MDP). See our post on Data Science for more on this.

5 Thomas Bayes
I think it is fair to say that Data Science tends to favor, or is at least open to, Bayesian methods.  While modern Bayesian statistics is much richer than a mere application of Bayes’ theorem, we can attribute at least some of its development back to Bayes.  To get a hang of Bayes’ theorem, I suggest playing around with the chain rule of probability to derive it yourself.
For having a major branch of statistics named after him and for being a fellow alum of the University of Edinburgh, Bayes is on the list. By the way, if you want to learn more about assumptions and interpretations of Bayesian methods check out our Data Science post for Michael Jordan’s lectures.

6 Solomon Kullback and Richard Leibler
Maybe not as big as some of the other folks on the list, so they have to share a place, but come on, the Kullback-Leiber Divergence (KL-D)?! That has got to be worth a place here. Mentioned in our post on Data Science resources, the KL-D is basically a measure of information gain (or loss). This turns out to be an important measure in almost every single machine learning algorithm you are bound to wind up using. Seriously, take a peek at the derivation of your favorite algorithms and you are likely to see the KL-D in there.

7 Edward Tufte
I used to work at an advertising agency back in the ‘90s, and while normally the ‘creatives’ would ignore us data folks (this was back before data was cool), one could often get a conversation going with some of the more forward thinking by name checking Tufte.  I even went to one of Tufte’s workshops during that time, where he was promoting his second book, Envisioning Information. There was a guest magician that did a little magic show as part of the presentation.  A minor irritation is the guru/follower vibe you can get from some people when they talk about him.  Anyway, don’t let that put you off since Tufte spends quality ink to inform you how to optimize the information contained in your ink.

As I mentioned at the beginning, this list is incomplete. I think a strong argument for Alan Turing , Ada Lovelace, Ronald Fisher can be made.  I debated putting Gauss in here, but for some reason, he seems just too big to be labeled a data scientist. Please suggest your favorite data scientist before there was data science in the comments below.

*yeah, its all men – please call out the women that I have missed.

Posted in Testing and Data Science, Uncategorized | 10 Comments

Get Access