The Management Consulting Playbook for AB Testing

The Management Consulting Playbook for AB Testing

Introduction

While I have definitely spent more of my time on the ML side of Consulting projects, and I definitely enjoy that. Very
often, I have had to put my dusty old statistician hat on and measure the performance of some of the algorithms I have
built. Most of my experience in this sense is in making sure that recommendation engines, once deployed, actually work.
In this article I will go over some of the major themes in AB Testing without getting into the specifics of measuring
whether a recommendation engine works.
I definitely enjoy the “measurement science” of these sorts of problems, it is a constant reminder that old school
statistics is not dead. In practice, it also allows one to make claims based on simulations, even if proofs are not
immediately clear. And I have attached some useful simulations.

Basic Structure of AB Testing

We start with the day of AB Testing, typically you are in a room with people, and you need to convince them that
your recommendation engine, feature (new button) or new pricing algorithm actually works. It is time to change focus
from the predictive aspect of machine learning to the casual inference side of statistics (bear in mind, towards the end
of this article, I will discuss briefly the causal inference side of ML).

Phase 1: Experimental Context

  • Define the feature that is being analyzed, do we even need AB testing, is the test even worth it. A great example of
    not needing a test is when your competition is doing it, and you need to keep up.

  • Define a metric of interest (in many Consulting use cases this corresponds directly to the fee of the engagement, so
    it is very important).

  • Define some guardrail metrics, these are usually independent of the experiment you are trying to run (revenue, profit,
    total rides, wait time etc.). These are usually the metrics that the business cares about and should not be harmed by
    the experiment.

  • Define a null hypothesis (usually an effect size of on the metric of interest). What would happen if you did
    not run the experiment, it might not be as easy as it seems. In recommendation engine context this is usually non-ML
    recommendations or an existing ML recommendation.

  • Define a significance level this is the maximum probability of rejecting the null hypothesis given that it is
    true. Usually . Do not get too hung up on this value, it is a convention. It is increasingly difficult to
    justify any value, humans are notorious at assigning probabilities to risk

  • Define the alternative hypothesis this is the minimum effect size you hope to see. For instance, if you ran an
    experiment such as PrimeTime pricing you would need to define the minimum change in the metric of your choice (will
    rides increase by s or %) you expect to see. This is typically informed by prior knowledge. This could also be
    the minimum size you would need to see to make the feature worth it.

  • Define the power level , usually, this is (this represents the minimum probability of rejecting the
    null hypothesis when is true). This means at the very least there is an % probability of rejecting the null
    when is true.

  • Pick a test statistic whose distribution is known under both hypotheses. Usually the sample average of the metric of
    interest.

  • Pick the minimum sample size needed to achieve the desired power level of given all the test parameters.

Before we move on, it is important to note that all the considerations regarding . etc. are all highly
subjective. Usually an existing statistics/ measurement science team will dictate those to you. It is also very likely
you will need a “Risk” team to have an opinion as well so that the overall company risk profile is not affected (say you
are testing out a recommendation engine, a new pricing algorithm, and you are doing cost cuts all at the same time, the
risk team might have an opinion on how much risk the company can take on at any given time). Some of this subjectivity
is what makes Bayesian approaches more appealing and motivates a simultaneous Bayesian approach to AB Testing.

Phase 2: Experiment Design

With the treatment, hypothesis, and metrics established, the next step is to define the unit of randomization for the
experiment and determine when each unit will participate. The chosen unit of randomization should allow accurate
measurement of the specified metrics, minimize interference and network effects, and account for user experience
considerations.The next couple of sections will dive deeper into certain considerations when designing an experiment,
and how to statistically overcome them. In a recommendation engine context, this can be quite complex, since both
treatment and control groups share the pool of products, it is possible that increased purchases from the online
recommendation can cause the stock to run out for in person users. So control group purchases of competitor products
could simply be because the product was not available and the treatment was much more effective than it seemed.

Unit of Randomization and Interference

Now that you have approval to run your experiment, you need to define the unit of randomization. This can be tricky
because often there are multiple levels at which randomization can be carried out for example, you can randomize your
app experience by session, you could also randomize it by user. This leads to our first big problem in AB testing. What
is the best unit of randomization and what are the pitfalls of picking the wrong unit.

Example of Interference

Interference is a huge problem in recommendation engines for most retail problems. Let me walk you through an
interesting example we saw for a large US retailer. We were testing whether a certain (high margin product) was being
recommended to users. The treatment group was shown the product and the control group was not. The metric of interest
was the number of purchases of a basket of high margin products. The control group purchased the product at a rate
of and the treatment group purchased the product at a rate of . The experiment was significant at
the level. However, after the experiment we noticed that the difference in sales closed up
to . This was because the treatment group was buying up the stock of the product and the
control group was not, because the act of being recommended the product was kind of treatment in itself. This is a
classic example of interference. This is a good reason to use a formal causal inference framework to measure the effect
of the treatment. One way to do this is DAGs, which I will discuss later. The best way to run an experiment like this is
to randomize by region. However, this is not always possible since regions share the same stock. But I think you get the
idea.

Robust Standard Errors in AB Tests

You can fix interference by clustering at the region level but very often this leads to another problem of its own. The
unit of treatment allocation is now fundamentally bigger than the unit at which you are conducting the analysis. We do
not really recommend products at the store level, we recommend products at the user level. So while we assign treatment
and control at the store level we are analyzing effects at the user level. As a consequence we need to adjust our
standard errors to account for this. This is where robust standard errors come in. In such a case, the standard errors
you calculate for the average treatment effect are lower than what they truly are. And this has far-reaching effects
for power, effect size and the like.

Recall, the variance of the OLS estimator

You can analyze the variance matrix under various assumptions to estimate,

Under homoscedasticity,

Under heteroscedasticity (Heteroscedastic robust standard errors),

And finally under
clustering,

The cookbook, for estimating is therefore multiplying your matrix with some kind of banded
matrix that represents your assumption ,

Range of Clustered Standard Errors

Where the left boundary is where no clustering occurs and all errors are independent and the right boundary is where the
clustering is very strong but variance between clusters is zero. It is fair to ask, why we need to multiply by a matrix
of assumptions at all, the answer is that the assumptions scale the error to tolerable levels, such that the error
is not too large or too small. By pure coincidence, it is possible to have high covariance between any two observations,
whether to include it or not is predicated by your assumption matrix .

Power Analysis

I have found that power analysis is an overlooked part of AB Testing, in Consulting you will probably have to work with
the existing experimentation team to make sure the experiment is powered correctly. There is usually some amount of
haggling and your tests are likely to be underpowered. There is a good argument to be made about overpowering your
tests (such a term does not exist in statistics, who would complain about that), but this usually comes with some risk
to guardrail metrics, thus you are likely to under power your tests when considering a guardrail metric. This is OKAY,
because remember the level is a convention, and the power level is also a convention that by definition err
on the side of NOT rejecting the null. So if you see an effect with an underpowered test you do have some latitude to
make a claim while reduce the significance level of your test.

Power analysis focuses on reducing the probability of accepting the null hypothesis when the alternative is true. To
increase the power of an A/B test and reduce false negatives, three key strategies can be applied:

  • Effect Size: Larger effect sizes are easier to detect. This can be achieved by testing bold, high-impact changes or
    trying new product areas with greater potential for improvement. Larger deviations from the baseline make it easier
    for the experiment to reveal significant effects.

  • Sample Size: Increasing sample size boosts the test’s accuracy and ability to detect smaller effects. With more data,
    the observed metric tends to be closer to its true value, enhancing the likelihood of detecting genuine effects.
    Adding more participants or reducing the number of test groups can improve power, though there’s a balance to strike
    between test size and the number of concurrent tests.

  • Reducing Metric Variability: Less variability in the test metric across the sample makes it easier to spot genuine
    effects. Targeting a more homogeneous sample or employing models that account for population variability helps reduce
    noise, making subtle signals easier to detect.

Finally, experiments are often powered at 80% for a postulated effect size — enough to detect meaningful changes that
justify the new feature’s costs or improvements. Meaningful effect sizes depend on context, domain knowledge, and
historical data on expected impacts, and this understanding helps allocate testing resources efficiently.

Power 2

In an A/B test, the power of a test (the probability of correctly detecting a true effect) is influenced by the effect
size, sample size, significance level, and pooled variance. The formula for power,
, can be approximated as follows for a two-sample test:

Where,

  • is the Minimum Detectable Effect (MDE), representing the smallest effect size we aim to detect.

  • is the critical z-score for a significance level
    (e.g., 1.96 for a 95% confidence level).

  • is the pooled standard deviation of the metric across groups, representing the combined
    variability.

  • is the sample size per group.

  • is the cumulative distribution function (CDF) of the standard normal distribution, which gives the
    probability that a value is below a given z-score.

Understanding the Role of Pooled Variance

  • Power decreases as the pooled variance
    () increases. Higher variance increases the "noise" in the data, making it more
    challenging to detect the effect (MDE) relative to the variation.

  • When pooled variance is low, the test statistic (difference between groups) is less likely to be drowned out by
    noise, so the test is more likely to detect even smaller differences. This results in higher power for a given
    sample size and effect size.

Practical Implications

In experimental design:

  • Reducing (e.g., by choosing a more homogeneous sample) improves power without increasing
    sample size.

  • If is high due to natural variability, increasing the sample size compensates by lowering
    the standard error , thereby maintaining power.

Difference in Difference

Randomizing by region to solve interference can create a new issue: regional trends may bias results. If, for example, a
fast-growing region is assigned to the treatment, any observed gains may simply reflect that region’s natural growth
rather than the treatment’s effect.

In recommender system tests aiming to boost sales, retention, or engagement, this issue can be problematic. Assigning a
growing region to control and a mature one to treatment will almost certainly make the treatment group appear more
effective, potentially masking the true impact of the recommendations.

Linear Regression Example of DiD

To understand the impact of a new treatment on a group, let’s consider an example where everyone in group receives a
treatment at time
. Our goal is to measure how this treatment affects outcomes over time.

First, we’ll introduce some notation:

Define , which tells us if belongs to a specific
set :

Let , which represents the period after treatment. We can use this to set up a few key indicators:

- if the time is after the treatment, and
otherwise. - if an individual is in group , meaning they received the treatment. -
if both and , identifying those in the treatment group during
the post-treatment period.

Using these indicators, we can build a simple linear regression model:

$$y_{it} = \beta_0 + \beta_1 \mathbf{1}_T(t) + \beta_2 \mathbf{1}_G(i) + \beta_3 \mathbf{1}_T(t) \mathbf{1}G(i) + \epsilon{it}$$

In this model, the coefficient is the term we’re most interested in. It represents the
difference-in-differences (DiD) effect:
how much the treatment group’s outcome changes after treatment compared to the control group’s change in the same
period. In other words,
gives us a clearer picture of the treatment’s direct impact, isolating it from other factors.

For this model to work reliably, we rely on the parallel trends assumption: the control and treatment groups would
have followed similar paths over time if there had been no treatment. Although the initial levels of can differ
between groups, they should trend together in the absence of intervention.

You can always test whether your data satisfies the parallel trends assumption by looking at it. In a practical
environment, I have never really tested this assumption, for two big reasons (it is also why I personally think DiD is
not a great method):

  • If you need to test an assumption in your data, you are likely to have a problem with your data. If it is not obvious
    from some non-statistical argument or plot etc you are unlikely to be able to convince a stakeholder that it is a good
    assumption.
  • The data required to test this assumption, usually invalidates its need. If you have data to test this assumption, you
    likely have enough data to run a more sophisticated model than DiD (like CUPED).

Having said all that, here are some ways you can test the parallel trends assumption:

  • Visual Inspection:

    • Plot the average outcome variable over time for both the treatment and control groups, focusing on the
      pre-treatment period. If the trends appear roughly parallel before the intervention, this provides visual evidence
      supporting the parallel trends assumption.

    • Make sure any divergence between the groups only occurs after the treatment.

  • Placebo Test:

    • Pretend the treatment occurred at a time prior to the actual intervention and re-run the DiD analysis. If you find
      a significant “effect” before the true treatment, this suggests that the parallel trends assumption may not hold.

    • Use a range of pre-treatment cutoff points and check if similar differences are estimated. Consistent non-zero
      results may indicate underlying trend differences unrelated to the actual treatment.

  • Event Study Analysis (Dynamic DiD):

    • Extend the DiD model by including lead and lag indicators for the treatment. For example:
      $$y_{it} = \beta_0 + \sum_{k=-K}^{-1} \gamma_k \mathbf{1}_{T+k}(t) \mathbf{1}_G(i) + \beta_1 \mathbf{1}_T(t) + \beta_2 \mathbf{1}_G(i) + \beta_3 \mathbf{1}_T(t) \mathbf{1}G(i) + \epsilon{it}$$
      where captures pre-treatment effects.

    • If pre-treatment coefficients (leads) are close to zero and non-significant, it supports the parallel trends
      assumption. Large or statistically significant leads could indicate violations of the assumption.

  • Formal Statistical Tests:

    • Run a regression on only the pre-treatment period, introducing an interaction term between time and group to test
      for significant differences in trends:

      $y_{it} = \alpha_0 + \alpha_1 \mathbf{1}_G(i) + \alpha_2 t + \alpha_3 (\mathbf{1}G(i) \times t) + \epsilon{it}$

    • If the coefficient on the interaction term is close to zero and statistically insignificant, this
      supports the parallel trends assumption. A significant would indicate a pre-treatment trend difference,
      which would challenge the assumption.

  • Covariate Adjustment (Conditional Parallel Trends):

    • If parallel trends don’t hold unconditionally, you might adjust for observable characteristics that vary between
      groups and influence the outcome. This is a more relaxed “conditional parallel trends” assumption, and you could
      check if trends are parallel after including covariates in the model.

If you can make all this work for you, great, I never have. In the dynamic world of recommendation engines (especially
always ‘’online’’ recommendation engines) it is very difficult to find a reasonably good cut-off point for the placebo
test. And the event study analysis is usually not very useful since the treatment is usually ongoing.

Peeking and Early Stopping

Your test is running, and you’re getting results—some look good, some look bad. Let’s say you decide to stop early and reject the null hypothesis because the data looked good. What could happen? Well, you shouldn’t. In short, you’re changing the power of the test. A quick simulation can show the difference: with early stopping or peeking, your rejection rate of the null hypothesis is much higher than the 0.05 you intended. This isn’t surprising since increasing the sample size raises the chance of rejecting the null when it’s true.

The benefits of early stopping aren’t just about self-control. It can also help prevent a bad experiment from affecting critical guardrail metrics, letting you limit the impact while still gathering needed information. Another example is when testing expendable items. Think about a magazine of bullets: if you test by firing each bullet, you’re guaranteed they all work—but now you have no bullets left. So you might rephrase the experiment as, How many bullets do I need to fire to know this magazine works?

In consulting you are going to peek early, you have to live with it. For one reason or another, a bug in production, an eager client whatever the case, you are going to peek, so you better prepare accordingly.

Simulated Effect of Peeking on Experiment Outcomes

Without Peeking With Peeking
(a) Without Peeking: reject null, (b) With Peeking: reject null,

Under a given null hypothesis, we run 100 simulations of experiments and record the z-statistic for each. We do this once without peeking and let the experiments run for observations. In the peeking case, we stop whenever the z-statistic crosses the boundary but only after th observation.

Sequential Testing for Peeking

The Sequential Probability Ratio Test (SPRT) compares the likelihood ratio at the -th observation, given by:

where and are the likelihood functions under the null hypothesis and the alternative hypothesis , respectively.

The test compares the likelihood ratio to two thresholds, and , and the decision rule is:

The thresholds and are determined based on the desired error probabilities. For a significance level (probability of a Type I error) and power (probability of detecting a true effect when is true), the thresholds are given by:

Normal Distribution

This test is in practice a lot easier to carry out for certain distributions like the normal distribution, assume an unknown mean and known variance

The sequential rule becomes the recurrent sum, (with )

With the stopping rule

  • : Accept

  • : Accept

  • : continue

There is another elegant method outlined in Evan Miller’s blog post, which I will not go into here but just state it for brevity (it is also used at Etsy, so there is certainly some benefit to it). It is a very good read and I highly recommend it.

  • At the beginning of the experiment, choose a sample size .
  • Assign subjects randomly to the treatment and control, with 50% probability each.
  • Track the number of incoming successes from the treatment group. Call this number .
  • Track the number of incoming successes from the control group. Call this number .
  • If reaches , stop the test. Declare the treatment to be the winner.
  • If reaches , stop the test. Declare no winner.
  • If neither of the above conditions is met, continue the test.

Using these techniques you can “peek” at the test data as it comes in and decide to stop as per your requirement. This is very useful as the following simulation using this more complex criteria shows. Note that what you want to verify is two things,

  • Does early stopping under the null hypothesis, accept the null in approximately fraction of simulations once the stopping criteria is reached and does it do so fast.

  • Does early stopping under the alternative reject the null hypothesis in fraction of simulations and does it do so fast.

The answer to these two questions is not always symmetrical, and it seems that we need more samples to reject the null (case 2) versus accept it case 1. Which is as it should be! But in both cases, as the simulations below show, you need a significantly fewer number of samples than before.

CUPED and Other Similar Techniques

Recall, our diff-in-diff equation,

Diff in Diff is nothing but CUPED for . I state this without proof. I was not able to find a clear one any where.

Consider the auto-regression with control variates regression equation, This is also NOT equivalent to CUPED, nor is it a special case. Again, I was not able to find a good proof anywhere.

Multiple Hypotheses

In most of the introduction, we set the scene by considering only one hypotheses. However, in real life you may want to test multiple hypotheses at the same time.

  • You may be testing multiple hypotheses even if you did not realize it, such as over time. In the example of early stopping you are actually checking multiple hypotheses. One at every time point.

  • You truly want to test multiple features of your product at the same time and want to run one test to see if the results got better.

Regression Model Setup

We consider a regression model with three treatments, , , and , to study their effects on a continuous outcome variable, . The model is specified as: where:

  • is the outcome variable,

  • , , and are binary treatment indicators (1 if the treatment is applied, 0 otherwise),

  • is the intercept,

  • , , and are the coefficients representing the effects of treatments , , and , respectively,

  • is the error term, assumed to be normally distributed with mean 0 and variance .

Hypotheses Setup

We aim to test whether each treatment has a significant effect on the outcome variable . This involves testing the null hypothesis that each treatment coefficient is zero.

The null hypotheses are formulated as follows:

Each null hypothesis represents the assumption that a particular treatment (either , , or ) has no effect on the outcome variable , implying that the treatment coefficient for that treatment is zero.

Multiple Hypothesis Testing

Since we are testing three hypotheses simultaneously, we need to control for the potential increase in false positives. We can use a multiple hypothesis testing correction method, such as the Bonferroni correction or the Benjamini-Hochberg procedure.

Bonferroni Correction

With the Bonferroni correction, we adjust the significance level for each hypothesis test by dividing it by the number of tests . If we want an overall significance level of , then each individual hypothesis would be tested at:

Benjamini-Hochberg Procedure

Alternatively, we could apply the Benjamini-Hochberg procedure to control the False Discovery Rate (FDR). The procedure involves sorting the p-values from smallest to largest and comparing each p-value with the threshold: where is the rank of the p-value and is the total number of tests. We declare all hypotheses with p-values meeting this criterion as significant. This framework allows us to assess the individual effects of , , and while properly accounting for multiple hypothesis testing.

Variance Reduction: CUPED

When analyzing the effectiveness of a recommender system, sometimes your metrics are skewed by high variance in the metric i.e. . One easy way to fix this is by using the usual outlier removal suite of techniques. However, outlier removal is a difficult thing to statistically define, and very often you may be losing “whales”. Customers who are truly large consumers of a product.
One easy way to do this would be to normalize the metric by its mean, i.e. . Any even better way to do this would be to normalize the metric by that users own mean, i.e. . This is the idea behind CUPED.

Consider, the regression form of the treatment equation,

Assume you have data about the metric from before, and have values . Where the subscript denoted the individuals outcome, before the experiment was even run, .

This is like running a regression of on .


Now, use those residuals in the treatment equation above,

And then estimate the treatment effect.

The statistical theory behind CUPED is fairly simple and setting up the regression equation is not difficult. However, in my experience, choosing the right window for pre-treatment covariates is extremely difficult, choose the right window and you reduce your variance by a lot. The right window depends a lot on your business.
Some key considerations,

  • Sustained purchasing behavior is a key requirement. If the is not a good predictor of for the interval to then the variance of will be high. Defeating the purpose.
  • Longer windows come with computational costs.
  • In practice, because companies are testing things all the time you could have noise left over from a previous experiment that you need to randomize/ control for.

Simulating CUPED

One way you can guess a good pre-treatment window is by simulating the treatment effect for various levels of MDEs (the change you expect to see in ) and plot the probability of rejecting the alternative hypothesis if it is true i.e. Power.

MDE vs Power for 2 Different Metrics

So you read off your hypothesized MDE and Power, and then every point to the left of that is a good window.
As an example, lets say you know your MDE to be and you want a power of , then your only option is the 16 week window.
Analogously, if you have an MDE of and you want a power of , then the conventional method (with no CUPED) is fine as you can attain an MDE of with a power of .
Finally, if you have an MDE of and you want a power of then a 1 week window is fine.

Finally, you can check that you have made the right choice by plotting the variance reduction factor against the pre-period (weeks) and see if the variance reduction factor is high.

CUPED is a very powerful technique, but if I could give one word of advice to anyone trying to do it, it would be this: get the pre-treatment window right. This has more to do with business intelligence than with statistics. In this specific example longer windows gave higher variance reduction, but I have seen cases where a “sweet spot” exists.

Variance Reduction: CUPAC

As it turns out we can control variance, by other means using the same principle as CUPED. The idea is to use a control variate that is not a function of the treatment. Recall, the regression equation we ran for CUPED, Generally speaking, this is often posed as finding some that is uncorrelated with the treatment but correlated with .

You could use any that is uncorrelated with the treatment but correlated with . An interesting thing to try would be to fit a highly non-linear machine learning model to (such as random forest, XGBoost) using a set of observable variables , call it . Then use as your .

Notice here two things, - that is not a function of but is a function of . - that does not (necessarily) need any data from to be calculated, so it is okay, if no pre-treatment data exists! - if pre-treatment data exists then you can use it to fit and then use it to predict at as well, so it can only enhance the performance of your fit and thereby reduce variance even more.

If you really think about it, any process to create pre-treatment covariates inevitably involves finding some highly correlated with outcome and uncorrelated with treatment and controlling for that. In CUPAC we just dump all of that into one ML model and let the model figure out the best way to control for variance using all the variables we threw in it.

I highly recommend CUPAC over CUPED, it is a more general technique and can be used in a wider variety of situations.
If you really want to, you can throw into the mix as well!

A Key Insight: Recommendation Engines and CUPAC/ CUPED

Take a step back and think about what is really saying in context of a recommender system, it is saying given some can I predict my outcome metric. Let us say the outcome metric is some , where is sales.

What is a recommender system? It takes some and predicts .

This basically means that a pretty good function to control for variance is a recommender system itself! Now you can see why CUPAC is so powerful, it is a way to control for variance using a recommender system itself. You have all the pieces ready for you. HOWEVER! You cannot use the recommender system you are currently testing as your , that would be mean that is correlated with and that would violate the assumption of uncorrelatedness. Usually, the existing recommender system (the pre-treatment one) can be used for this purpose. The finally variable then has a nice interpretation it is not the difference between what people truly did and the recommended value, but rather the difference between the two recommender systems! Any model is a variance reduction model, it is just a question of how much variance it reduces. Since the existing recommender system is good enough it is likely to reduce a lot of variance. If it is terrible (which is why they hired you in the first place) then this approach is unlikely to work. But in my experience, existing recommendations are always pretty good in the industry it is a question of finding those last few drops of performance increase.

Conclusion

The above are pretty much all you can expect to find

No, You Cannot RCT Your Way to Policy

No, You Cannot RCT Your Way to Policy

The Bayesian Policy Maker

The Big 3 of RCTs in Economics, Abhijit Banerjee, Esther Duflo and Michael Kramer. Prior to their work in Kenya and India, RCTs were relatively unheard of for policy evaluations in development economics.

Ah they say, so here is what you do, you see its very simple. You gather data, you gather all the facts, and then you do the statistics you see, and then you make your decision. You see, a modern policymaker shouldn’t bother with the inconveniences of a ideology and emotions et cetera, that stuff is for amateurs you see.

Also, you attach a token picture of poor people being poor in a 3rd world country on a website for RCTs (the cover image above is taken from one such website, not sure why it is relevant to their study) and you are well on your way to success!

Rarefied air of RCTs

Angus Deaton and Nancy Cartwright are outspoken critics of RCTs. Much of this article is a summary of the key statistical issues with RCTs, from their seminal paper, _Understanding and misunderstanding randomized controlled trials_

In the hallowed halls of economics, evidence-based policy has long been the order of the day, with statisticians and economists working hand in glove to unravel the mysteries of various policies. This delightful dance of data was often accompanied by the sage nods of experts. Enter the Randomized Controlled Trial (RCT), purportedly requiring nary an assumption nor a whisper of prior knowledge.

Ah, but herein lies the rub! Some social scientists, in their wide-eyed admiration, have crowned RCTs as the veritable holy grail of evidence, declaring that any nugget of knowledge gleaned from an RCT is the unvarnished truth, thus tossing aside the cumbersome baggage of expert opinion. Combine this with the dazzling allure of Bayesian epistemology, and we have a recipe for an unearned swagger in the land of causal inference.

This article aims to lift the veil, to show that the RCT, for all its bravado, is not above the same constraints and foibles that bedevil other studies. The RCT is not a knight in shining armor, but a gallant figure subject to the same trials and tribulations as its more scholarly counterparts.

RCTs A History

Map of current RCTs in the world.
RCTs have their roots in clinical and epidemiological studies. This is perhaps its first impediment to their use in economics and the social sciences. Social issues are often more complex and have more than one causal pathway as opposed to drugs which usually have one casual pathway and a very easily verifiable target (a bacteria or a virus). The second impediment is that while both medicine and social sciences often use overlapping terms they often use quite different language when referring to RCTs. Thus what is known in medicine is not often known and and is almost never salient when considering an RCT in economics/ social sciences. We consider two issues :

  • Average Treatment effect and why they are not the truth

  • How to use an RCT’s results once we have them

Bias and Precision

Any given statistical study usually reports both numbers. Low precision, low bias and high precision, high bias are both considered "good" studies. As you can see however, low bias does not mean that any **one** arrow is close to the target, it simply means that the errors cancel out such that their midpoint is very close to the target.
Let us clarify two important terms: bias and precision. To a non-technical audience, the term "unbiased" often carries an unusually high status, perhaps because it is commonly associated with impartiality in political opinions. However, in statistics, being "unbiased" simply means that on average, the results are correct. It does not imply accuracy in every instance. Each individual RCT might produce highly erroneous results in either direction, but these errors tend to cancel out when averaged. Consequently, the fact that an RCT is unbiased provides limited value.

The second term, precision, refers to the degree of correctness on average. In the context of RCTs, precision indicates how close the results are to the true value on average. RCTs are notoriously imprecise, as illustrated by studies that have documented large errors in both directions, such as those involving hormone replacement therapy (HRT). This lack of precision is well-known, and economists often seek to enhance precision by incorporating "biased" and "subjective" expert opinions.

Measure Theory and the ATE

Okay, maybe not everything, but it certainly helps to see exactly what parts of our statistical theory are "magical".
All misunderstandings about probability come from the confidence of individuals who have never had the wind knocked out of them by measure theory. And so in this section that is what we will do.

Fundamentally, the treatment effect, ’s equation is given by, Where the boolean variable is or accordingly as whether the th individual is in treatment or control. Ideally we would like to measure . That is, we would like to observe the same individual in treatment and control and measure the difference in outcome in the two cases. In absence of this we can only observe i.e. the difference in population mean between the treated and un-treated population. It is a remarkable theorem from statistical theory that says that, the difference in these means is an unbiased estimator of the treatment effect. This is remarkable because it requires very few if any assumptions. Recall, that unbiased-ness buys us relatively little for a study done once as it could be a completely random effect we observe in any one study. Below, is a measure theoretic proof of why this is the case, Where this last equation follows from the linear nature of the expectation operator. This is another vital weakness of the RCT, you can only ever get at the mean effect. You cannot get a meaningful measure of any other statistic. As an economist we are very often concerned with the median and below is a similar proof of why this is not the case, one can immediately see that this is a lot hairier to linearly separate than before, i.e. This is another critical weakness of an RCT, you can only tell what the treatment effect is in expectation. Though not useless, this is far from usual when a statistician would rather know the entire distribution of outcomes. This is generally the case with other forms of studies.

Randomization

This book does not mention or do justice to several key issues in randomization. This book is so enthusiastic about randomization one could mistake it for propaganda.
Randomization is often looked at as this perfect tool that answers all questions related to variance between the treatment and control group but as we will see this is often not the case in practice. Recall,

Usually, the second term on the right is equated to . But there is considerable slight of hand involved here. While un-biasedness guarantees that the second term is in expectation. In any one trial we have no idea about the size of this term. This is referred to in the clinical trials literature as random confounding or realized (as in one realization of a trial) confounding.

Randomization and Balance

Exactly what randomization does is lost in popular parlance. There is often a misconception that randomization (in the sense of a laboratory clinical experiment) does as much as a perfect control. This is not the case, randomization is often always far worse than a good control. This fact is often lost in popular literature and can be captured by this quote in a World Bank manual attributed to Gerter et al 2016. We can be confident that our estimated impact contributes the true impact of the program, since we have eliminated all observed and the unobserved factors that might plausibly explain the difference in outcomes. This statement confuses the fact that the second term is zero in expectation over many hypothetical trials (which this study did not do) and with it being zero in any one trial. Popular economics literature is littered with such statements. It is this lack of nuance (and grace) that is the cause for RCTs being as widely misunderstood as they are.

Post Mortem

What is the Economics often gets lost in RCTs. We often end up with a clever RCT but one that does not do any 'real' economics. Maybe we should have an RCT evaluating our field.

While, I have many misgivings about Bayesian epistemology in general, I think it is a very narrow way of viewing the world. It is also pseudo mathematical for a variety of reasons (see Pollock for a mathematical discussion of why it is not a tractable mathematical theory, but rather a subjective philosophy that appropriates real mathematics). For the reasons mentioned here, I think that RCTs do not often contribute enough signal to update probabilities about hypotheses in a meaningful way. In addition, it has become fashionable to quote a counter intuitive or counter-theoretical result from an RCT to suggest that theory needs to change, rather than requiring more studies or more information about causal pathways to improve the body of scientific evidence in either direction. Finally, the idea that RCTs require little to theory is patently false. Since a good RCT requires a good control which often requires a good theory which in turn is subject to all the shortcomings of the human experience such as political bias and subjectivity.

No causality in, no causality out.

Words to live by indeed.

References

Critique of Bayesian Epistemology (https://johnpollock.us/ftp/PAPERS/Problems%20for%20Bayesian%20Epistemology.pdf)
Understanding and misunderstanding randomized controlled trials (https://pubmed.ncbi.nlm.nih.gov/29331519/)