The Management Consulting Playbook for AB Testing

The Management Consulting Playbook for AB Testing

Introduction

While I have definitely spent more of my time on the ML side of Consulting projects, and I definitely enjoy that. Very
often, I have had to put my dusty old statistician hat on and measure the performance of some of the algorithms I have
built. Most of my experience in this sense is in making sure that recommendation engines, once deployed, actually work.
In this article I will go over some of the major themes in AB Testing without getting into the specifics of measuring
whether a recommendation engine works.
I definitely enjoy the “measurement science” of these sorts of problems, it is a constant reminder that old school
statistics is not dead. In practice, it also allows one to make claims based on simulations, even if proofs are not
immediately clear. And I have attached some useful simulations.

Basic Structure of AB Testing

We start with the day of AB Testing, typically you are in a room with people, and you need to convince them that
your recommendation engine, feature (new button) or new pricing algorithm actually works. It is time to change focus
from the predictive aspect of machine learning to the casual inference side of statistics (bear in mind, towards the end
of this article, I will discuss briefly the causal inference side of ML).

Phase 1: Experimental Context

  • Define the feature that is being analyzed, do we even need AB testing, is the test even worth it. A great example of
    not needing a test is when your competition is doing it, and you need to keep up.

  • Define a metric of interest (in many Consulting use cases this corresponds directly to the fee of the engagement, so
    it is very important).

  • Define some guardrail metrics, these are usually independent of the experiment you are trying to run (revenue, profit,
    total rides, wait time etc.). These are usually the metrics that the business cares about and should not be harmed by
    the experiment.

  • Define a null hypothesis (usually an effect size of on the metric of interest). What would happen if you did
    not run the experiment, it might not be as easy as it seems. In recommendation engine context this is usually non-ML
    recommendations or an existing ML recommendation.

  • Define a significance level this is the maximum probability of rejecting the null hypothesis given that it is
    true. Usually . Do not get too hung up on this value, it is a convention. It is increasingly difficult to
    justify any value, humans are notorious at assigning probabilities to risk

  • Define the alternative hypothesis this is the minimum effect size you hope to see. For instance, if you ran an
    experiment such as PrimeTime pricing you would need to define the minimum change in the metric of your choice (will
    rides increase by s or %) you expect to see. This is typically informed by prior knowledge. This could also be
    the minimum size you would need to see to make the feature worth it.

  • Define the power level , usually, this is (this represents the minimum probability of rejecting the
    null hypothesis when is true). This means at the very least there is an % probability of rejecting the null
    when is true.

  • Pick a test statistic whose distribution is known under both hypotheses. Usually the sample average of the metric of
    interest.

  • Pick the minimum sample size needed to achieve the desired power level of given all the test parameters.

Before we move on, it is important to note that all the considerations regarding . etc. are all highly
subjective. Usually an existing statistics/ measurement science team will dictate those to you. It is also very likely
you will need a “Risk” team to have an opinion as well so that the overall company risk profile is not affected (say you
are testing out a recommendation engine, a new pricing algorithm, and you are doing cost cuts all at the same time, the
risk team might have an opinion on how much risk the company can take on at any given time). Some of this subjectivity
is what makes Bayesian approaches more appealing and motivates a simultaneous Bayesian approach to AB Testing.

Phase 2: Experiment Design

With the treatment, hypothesis, and metrics established, the next step is to define the unit of randomization for the
experiment and determine when each unit will participate. The chosen unit of randomization should allow accurate
measurement of the specified metrics, minimize interference and network effects, and account for user experience
considerations.The next couple of sections will dive deeper into certain considerations when designing an experiment,
and how to statistically overcome them. In a recommendation engine context, this can be quite complex, since both
treatment and control groups share the pool of products, it is possible that increased purchases from the online
recommendation can cause the stock to run out for in person users. So control group purchases of competitor products
could simply be because the product was not available and the treatment was much more effective than it seemed.

Unit of Randomization and Interference

Now that you have approval to run your experiment, you need to define the unit of randomization. This can be tricky
because often there are multiple levels at which randomization can be carried out for example, you can randomize your
app experience by session, you could also randomize it by user. This leads to our first big problem in AB testing. What
is the best unit of randomization and what are the pitfalls of picking the wrong unit.

Example of Interference

Interference is a huge problem in recommendation engines for most retail problems. Let me walk you through an
interesting example we saw for a large US retailer. We were testing whether a certain (high margin product) was being
recommended to users. The treatment group was shown the product and the control group was not. The metric of interest
was the number of purchases of a basket of high margin products. The control group purchased the product at a rate
of and the treatment group purchased the product at a rate of . The experiment was significant at
the level. However, after the experiment we noticed that the difference in sales closed up
to . This was because the treatment group was buying up the stock of the product and the
control group was not, because the act of being recommended the product was kind of treatment in itself. This is a
classic example of interference. This is a good reason to use a formal causal inference framework to measure the effect
of the treatment. One way to do this is DAGs, which I will discuss later. The best way to run an experiment like this is
to randomize by region. However, this is not always possible since regions share the same stock. But I think you get the
idea.

Robust Standard Errors in AB Tests

You can fix interference by clustering at the region level but very often this leads to another problem of its own. The
unit of treatment allocation is now fundamentally bigger than the unit at which you are conducting the analysis. We do
not really recommend products at the store level, we recommend products at the user level. So while we assign treatment
and control at the store level we are analyzing effects at the user level. As a consequence we need to adjust our
standard errors to account for this. This is where robust standard errors come in. In such a case, the standard errors
you calculate for the average treatment effect are lower than what they truly are. And this has far-reaching effects
for power, effect size and the like.

Recall, the variance of the OLS estimator

You can analyze the variance matrix under various assumptions to estimate,

Under homoscedasticity,

Under heteroscedasticity (Heteroscedastic robust standard errors),

And finally under
clustering,

The cookbook, for estimating is therefore multiplying your matrix with some kind of banded
matrix that represents your assumption ,

Range of Clustered Standard Errors

Where the left boundary is where no clustering occurs and all errors are independent and the right boundary is where the
clustering is very strong but variance between clusters is zero. It is fair to ask, why we need to multiply by a matrix
of assumptions at all, the answer is that the assumptions scale the error to tolerable levels, such that the error
is not too large or too small. By pure coincidence, it is possible to have high covariance between any two observations,
whether to include it or not is predicated by your assumption matrix .

Power Analysis

I have found that power analysis is an overlooked part of AB Testing, in Consulting you will probably have to work with
the existing experimentation team to make sure the experiment is powered correctly. There is usually some amount of
haggling and your tests are likely to be underpowered. There is a good argument to be made about overpowering your
tests (such a term does not exist in statistics, who would complain about that), but this usually comes with some risk
to guardrail metrics, thus you are likely to under power your tests when considering a guardrail metric. This is OKAY,
because remember the level is a convention, and the power level is also a convention that by definition err
on the side of NOT rejecting the null. So if you see an effect with an underpowered test you do have some latitude to
make a claim while reduce the significance level of your test.

Power analysis focuses on reducing the probability of accepting the null hypothesis when the alternative is true. To
increase the power of an A/B test and reduce false negatives, three key strategies can be applied:

  • Effect Size: Larger effect sizes are easier to detect. This can be achieved by testing bold, high-impact changes or
    trying new product areas with greater potential for improvement. Larger deviations from the baseline make it easier
    for the experiment to reveal significant effects.

  • Sample Size: Increasing sample size boosts the test’s accuracy and ability to detect smaller effects. With more data,
    the observed metric tends to be closer to its true value, enhancing the likelihood of detecting genuine effects.
    Adding more participants or reducing the number of test groups can improve power, though there’s a balance to strike
    between test size and the number of concurrent tests.

  • Reducing Metric Variability: Less variability in the test metric across the sample makes it easier to spot genuine
    effects. Targeting a more homogeneous sample or employing models that account for population variability helps reduce
    noise, making subtle signals easier to detect.

Finally, experiments are often powered at 80% for a postulated effect size — enough to detect meaningful changes that
justify the new feature’s costs or improvements. Meaningful effect sizes depend on context, domain knowledge, and
historical data on expected impacts, and this understanding helps allocate testing resources efficiently.

Power 2

In an A/B test, the power of a test (the probability of correctly detecting a true effect) is influenced by the effect
size, sample size, significance level, and pooled variance. The formula for power,
, can be approximated as follows for a two-sample test:

Where,

  • is the Minimum Detectable Effect (MDE), representing the smallest effect size we aim to detect.

  • is the critical z-score for a significance level
    (e.g., 1.96 for a 95% confidence level).

  • is the pooled standard deviation of the metric across groups, representing the combined
    variability.

  • is the sample size per group.

  • is the cumulative distribution function (CDF) of the standard normal distribution, which gives the
    probability that a value is below a given z-score.

Understanding the Role of Pooled Variance

  • Power decreases as the pooled variance
    () increases. Higher variance increases the "noise" in the data, making it more
    challenging to detect the effect (MDE) relative to the variation.

  • When pooled variance is low, the test statistic (difference between groups) is less likely to be drowned out by
    noise, so the test is more likely to detect even smaller differences. This results in higher power for a given
    sample size and effect size.

Practical Implications

In experimental design:

  • Reducing (e.g., by choosing a more homogeneous sample) improves power without increasing
    sample size.

  • If is high due to natural variability, increasing the sample size compensates by lowering
    the standard error , thereby maintaining power.

Difference in Difference

Randomizing by region to solve interference can create a new issue: regional trends may bias results. If, for example, a
fast-growing region is assigned to the treatment, any observed gains may simply reflect that region’s natural growth
rather than the treatment’s effect.

In recommender system tests aiming to boost sales, retention, or engagement, this issue can be problematic. Assigning a
growing region to control and a mature one to treatment will almost certainly make the treatment group appear more
effective, potentially masking the true impact of the recommendations.

Linear Regression Example of DiD

To understand the impact of a new treatment on a group, let’s consider an example where everyone in group receives a
treatment at time
. Our goal is to measure how this treatment affects outcomes over time.

First, we’ll introduce some notation:

Define , which tells us if belongs to a specific
set :

Let , which represents the period after treatment. We can use this to set up a few key indicators:

- if the time is after the treatment, and
otherwise. - if an individual is in group , meaning they received the treatment. -
if both and , identifying those in the treatment group during
the post-treatment period.

Using these indicators, we can build a simple linear regression model:

$$y_{it} = \beta_0 + \beta_1 \mathbf{1}_T(t) + \beta_2 \mathbf{1}_G(i) + \beta_3 \mathbf{1}_T(t) \mathbf{1}G(i) + \epsilon{it}$$

In this model, the coefficient is the term we’re most interested in. It represents the
difference-in-differences (DiD) effect:
how much the treatment group’s outcome changes after treatment compared to the control group’s change in the same
period. In other words,
gives us a clearer picture of the treatment’s direct impact, isolating it from other factors.

For this model to work reliably, we rely on the parallel trends assumption: the control and treatment groups would
have followed similar paths over time if there had been no treatment. Although the initial levels of can differ
between groups, they should trend together in the absence of intervention.

You can always test whether your data satisfies the parallel trends assumption by looking at it. In a practical
environment, I have never really tested this assumption, for two big reasons (it is also why I personally think DiD is
not a great method):

  • If you need to test an assumption in your data, you are likely to have a problem with your data. If it is not obvious
    from some non-statistical argument or plot etc you are unlikely to be able to convince a stakeholder that it is a good
    assumption.
  • The data required to test this assumption, usually invalidates its need. If you have data to test this assumption, you
    likely have enough data to run a more sophisticated model than DiD (like CUPED).

Having said all that, here are some ways you can test the parallel trends assumption:

  • Visual Inspection:

    • Plot the average outcome variable over time for both the treatment and control groups, focusing on the
      pre-treatment period. If the trends appear roughly parallel before the intervention, this provides visual evidence
      supporting the parallel trends assumption.

    • Make sure any divergence between the groups only occurs after the treatment.

  • Placebo Test:

    • Pretend the treatment occurred at a time prior to the actual intervention and re-run the DiD analysis. If you find
      a significant “effect” before the true treatment, this suggests that the parallel trends assumption may not hold.

    • Use a range of pre-treatment cutoff points and check if similar differences are estimated. Consistent non-zero
      results may indicate underlying trend differences unrelated to the actual treatment.

  • Event Study Analysis (Dynamic DiD):

    • Extend the DiD model by including lead and lag indicators for the treatment. For example:
      $$y_{it} = \beta_0 + \sum_{k=-K}^{-1} \gamma_k \mathbf{1}_{T+k}(t) \mathbf{1}_G(i) + \beta_1 \mathbf{1}_T(t) + \beta_2 \mathbf{1}_G(i) + \beta_3 \mathbf{1}_T(t) \mathbf{1}G(i) + \epsilon{it}$$
      where captures pre-treatment effects.

    • If pre-treatment coefficients (leads) are close to zero and non-significant, it supports the parallel trends
      assumption. Large or statistically significant leads could indicate violations of the assumption.

  • Formal Statistical Tests:

    • Run a regression on only the pre-treatment period, introducing an interaction term between time and group to test
      for significant differences in trends:

      $y_{it} = \alpha_0 + \alpha_1 \mathbf{1}_G(i) + \alpha_2 t + \alpha_3 (\mathbf{1}G(i) \times t) + \epsilon{it}$

    • If the coefficient on the interaction term is close to zero and statistically insignificant, this
      supports the parallel trends assumption. A significant would indicate a pre-treatment trend difference,
      which would challenge the assumption.

  • Covariate Adjustment (Conditional Parallel Trends):

    • If parallel trends don’t hold unconditionally, you might adjust for observable characteristics that vary between
      groups and influence the outcome. This is a more relaxed “conditional parallel trends” assumption, and you could
      check if trends are parallel after including covariates in the model.

If you can make all this work for you, great, I never have. In the dynamic world of recommendation engines (especially
always ‘’online’’ recommendation engines) it is very difficult to find a reasonably good cut-off point for the placebo
test. And the event study analysis is usually not very useful since the treatment is usually ongoing.

Peeking and Early Stopping

Your test is running, and you’re getting results—some look good, some look bad. Let’s say you decide to stop early and reject the null hypothesis because the data looked good. What could happen? Well, you shouldn’t. In short, you’re changing the power of the test. A quick simulation can show the difference: with early stopping or peeking, your rejection rate of the null hypothesis is much higher than the 0.05 you intended. This isn’t surprising since increasing the sample size raises the chance of rejecting the null when it’s true.

The benefits of early stopping aren’t just about self-control. It can also help prevent a bad experiment from affecting critical guardrail metrics, letting you limit the impact while still gathering needed information. Another example is when testing expendable items. Think about a magazine of bullets: if you test by firing each bullet, you’re guaranteed they all work—but now you have no bullets left. So you might rephrase the experiment as, How many bullets do I need to fire to know this magazine works?

In consulting you are going to peek early, you have to live with it. For one reason or another, a bug in production, an eager client whatever the case, you are going to peek, so you better prepare accordingly.

Simulated Effect of Peeking on Experiment Outcomes

Without Peeking With Peeking
(a) Without Peeking: reject null, (b) With Peeking: reject null,

Under a given null hypothesis, we run 100 simulations of experiments and record the z-statistic for each. We do this once without peeking and let the experiments run for observations. In the peeking case, we stop whenever the z-statistic crosses the boundary but only after th observation.

Sequential Testing for Peeking

The Sequential Probability Ratio Test (SPRT) compares the likelihood ratio at the -th observation, given by:

where and are the likelihood functions under the null hypothesis and the alternative hypothesis , respectively.

The test compares the likelihood ratio to two thresholds, and , and the decision rule is:

The thresholds and are determined based on the desired error probabilities. For a significance level (probability of a Type I error) and power (probability of detecting a true effect when is true), the thresholds are given by:

Normal Distribution

This test is in practice a lot easier to carry out for certain distributions like the normal distribution, assume an unknown mean and known variance

The sequential rule becomes the recurrent sum, (with )

With the stopping rule

  • : Accept

  • : Accept

  • : continue

There is another elegant method outlined in Evan Miller’s blog post, which I will not go into here but just state it for brevity (it is also used at Etsy, so there is certainly some benefit to it). It is a very good read and I highly recommend it.

  • At the beginning of the experiment, choose a sample size .
  • Assign subjects randomly to the treatment and control, with 50% probability each.
  • Track the number of incoming successes from the treatment group. Call this number .
  • Track the number of incoming successes from the control group. Call this number .
  • If reaches , stop the test. Declare the treatment to be the winner.
  • If reaches , stop the test. Declare no winner.
  • If neither of the above conditions is met, continue the test.

Using these techniques you can “peek” at the test data as it comes in and decide to stop as per your requirement. This is very useful as the following simulation using this more complex criteria shows. Note that what you want to verify is two things,

  • Does early stopping under the null hypothesis, accept the null in approximately fraction of simulations once the stopping criteria is reached and does it do so fast.

  • Does early stopping under the alternative reject the null hypothesis in fraction of simulations and does it do so fast.

The answer to these two questions is not always symmetrical, and it seems that we need more samples to reject the null (case 2) versus accept it case 1. Which is as it should be! But in both cases, as the simulations below show, you need a significantly fewer number of samples than before.

CUPED and Other Similar Techniques

Recall, our diff-in-diff equation,

Diff in Diff is nothing but CUPED for . I state this without proof. I was not able to find a clear one any where.

Consider the auto-regression with control variates regression equation, This is also NOT equivalent to CUPED, nor is it a special case. Again, I was not able to find a good proof anywhere.

Multiple Hypotheses

In most of the introduction, we set the scene by considering only one hypotheses. However, in real life you may want to test multiple hypotheses at the same time.

  • You may be testing multiple hypotheses even if you did not realize it, such as over time. In the example of early stopping you are actually checking multiple hypotheses. One at every time point.

  • You truly want to test multiple features of your product at the same time and want to run one test to see if the results got better.

Regression Model Setup

We consider a regression model with three treatments, , , and , to study their effects on a continuous outcome variable, . The model is specified as: where:

  • is the outcome variable,

  • , , and are binary treatment indicators (1 if the treatment is applied, 0 otherwise),

  • is the intercept,

  • , , and are the coefficients representing the effects of treatments , , and , respectively,

  • is the error term, assumed to be normally distributed with mean 0 and variance .

Hypotheses Setup

We aim to test whether each treatment has a significant effect on the outcome variable . This involves testing the null hypothesis that each treatment coefficient is zero.

The null hypotheses are formulated as follows:

Each null hypothesis represents the assumption that a particular treatment (either , , or ) has no effect on the outcome variable , implying that the treatment coefficient for that treatment is zero.

Multiple Hypothesis Testing

Since we are testing three hypotheses simultaneously, we need to control for the potential increase in false positives. We can use a multiple hypothesis testing correction method, such as the Bonferroni correction or the Benjamini-Hochberg procedure.

Bonferroni Correction

With the Bonferroni correction, we adjust the significance level for each hypothesis test by dividing it by the number of tests . If we want an overall significance level of , then each individual hypothesis would be tested at:

Benjamini-Hochberg Procedure

Alternatively, we could apply the Benjamini-Hochberg procedure to control the False Discovery Rate (FDR). The procedure involves sorting the p-values from smallest to largest and comparing each p-value with the threshold: where is the rank of the p-value and is the total number of tests. We declare all hypotheses with p-values meeting this criterion as significant. This framework allows us to assess the individual effects of , , and while properly accounting for multiple hypothesis testing.

Variance Reduction: CUPED

When analyzing the effectiveness of a recommender system, sometimes your metrics are skewed by high variance in the metric i.e. . One easy way to fix this is by using the usual outlier removal suite of techniques. However, outlier removal is a difficult thing to statistically define, and very often you may be losing “whales”. Customers who are truly large consumers of a product.
One easy way to do this would be to normalize the metric by its mean, i.e. . Any even better way to do this would be to normalize the metric by that users own mean, i.e. . This is the idea behind CUPED.

Consider, the regression form of the treatment equation,

Assume you have data about the metric from before, and have values . Where the subscript denoted the individuals outcome, before the experiment was even run, .

This is like running a regression of on .


Now, use those residuals in the treatment equation above,

And then estimate the treatment effect.

The statistical theory behind CUPED is fairly simple and setting up the regression equation is not difficult. However, in my experience, choosing the right window for pre-treatment covariates is extremely difficult, choose the right window and you reduce your variance by a lot. The right window depends a lot on your business.
Some key considerations,

  • Sustained purchasing behavior is a key requirement. If the is not a good predictor of for the interval to then the variance of will be high. Defeating the purpose.
  • Longer windows come with computational costs.
  • In practice, because companies are testing things all the time you could have noise left over from a previous experiment that you need to randomize/ control for.

Simulating CUPED

One way you can guess a good pre-treatment window is by simulating the treatment effect for various levels of MDEs (the change you expect to see in ) and plot the probability of rejecting the alternative hypothesis if it is true i.e. Power.

MDE vs Power for 2 Different Metrics

So you read off your hypothesized MDE and Power, and then every point to the left of that is a good window.
As an example, lets say you know your MDE to be and you want a power of , then your only option is the 16 week window.
Analogously, if you have an MDE of and you want a power of , then the conventional method (with no CUPED) is fine as you can attain an MDE of with a power of .
Finally, if you have an MDE of and you want a power of then a 1 week window is fine.

Finally, you can check that you have made the right choice by plotting the variance reduction factor against the pre-period (weeks) and see if the variance reduction factor is high.

CUPED is a very powerful technique, but if I could give one word of advice to anyone trying to do it, it would be this: get the pre-treatment window right. This has more to do with business intelligence than with statistics. In this specific example longer windows gave higher variance reduction, but I have seen cases where a “sweet spot” exists.

Variance Reduction: CUPAC

As it turns out we can control variance, by other means using the same principle as CUPED. The idea is to use a control variate that is not a function of the treatment. Recall, the regression equation we ran for CUPED, Generally speaking, this is often posed as finding some that is uncorrelated with the treatment but correlated with .

You could use any that is uncorrelated with the treatment but correlated with . An interesting thing to try would be to fit a highly non-linear machine learning model to (such as random forest, XGBoost) using a set of observable variables , call it . Then use as your .

Notice here two things, - that is not a function of but is a function of . - that does not (necessarily) need any data from to be calculated, so it is okay, if no pre-treatment data exists! - if pre-treatment data exists then you can use it to fit and then use it to predict at as well, so it can only enhance the performance of your fit and thereby reduce variance even more.

If you really think about it, any process to create pre-treatment covariates inevitably involves finding some highly correlated with outcome and uncorrelated with treatment and controlling for that. In CUPAC we just dump all of that into one ML model and let the model figure out the best way to control for variance using all the variables we threw in it.

I highly recommend CUPAC over CUPED, it is a more general technique and can be used in a wider variety of situations.
If you really want to, you can throw into the mix as well!

A Key Insight: Recommendation Engines and CUPAC/ CUPED

Take a step back and think about what is really saying in context of a recommender system, it is saying given some can I predict my outcome metric. Let us say the outcome metric is some , where is sales.

What is a recommender system? It takes some and predicts .

This basically means that a pretty good function to control for variance is a recommender system itself! Now you can see why CUPAC is so powerful, it is a way to control for variance using a recommender system itself. You have all the pieces ready for you. HOWEVER! You cannot use the recommender system you are currently testing as your , that would be mean that is correlated with and that would violate the assumption of uncorrelatedness. Usually, the existing recommender system (the pre-treatment one) can be used for this purpose. The finally variable then has a nice interpretation it is not the difference between what people truly did and the recommended value, but rather the difference between the two recommender systems! Any model is a variance reduction model, it is just a question of how much variance it reduces. Since the existing recommender system is good enough it is likely to reduce a lot of variance. If it is terrible (which is why they hired you in the first place) then this approach is unlikely to work. But in my experience, existing recommendations are always pretty good in the industry it is a question of finding those last few drops of performance increase.

Conclusion

The above are pretty much all you can expect to find