1 of 16

Ronny Kohavi, Alex Deng, Lukas Vermeer��

A/B Testing Intuition Busters

Common Misunderstandings in Online Controlled Experiments

@RonnyK

Paper at http://bit.ly/ABTestingIntuitionBusters

2 of 16

Motivation

“6 hours of debugging can save you 5 minutes of reading documentation”� -- Tweet by Jakob Cosoroabă
Two weeks of deep-diving into experiment results can save you an hour of reading this paper
Make sure you understand p-values and statistical power
Be skeptical of many published results

#2

3 of 16

Tough Paper to Write

I’ve been involved in A/B tests at Amazon (Weblab), Microsoft (led �ExP, the experimentation platform), and Airbnb (ERF). �I co-authored a book on experimentation that is usually in the top 10 in�Data Mining on Amazon, and I teach a quarterly Zoom class on A/B testing.�Alex was at Microsoft ExP and Airbnb.�Lukas was director of experimentation at Booking and now Vista
We called out multiple vendors’ mistakes, book authors’ mistakes, and we shared (really shredded) an example from GuessTheTest where several mistakes were made (with the owner’s permission)
The meta-reviewer wrote

I would very much encourage the authors to reread it and tone it down in parts…�If not done so, the paper would in fact embarrass KDD for years…

#3

4 of 16

P-Values

Misinterpretation and abuse of statistical tests,�confidence intervals, and statistical power have�been decried for decades, yet remain rampant.�A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof

-- Greenland et al (2016)

#4

Vendors (e.g., Optimizely) try to hide the complexity by calling 1-(p-value) “confidence,” but it is misleading , as confidence is NOT the probability that the result is a true positive. Documentation is often wrong.
Book authors frequently get it wrong (see paper for examples)

5 of 16

What We Want vs. What we Get

#5

6 of 16

#6

�

Company / Source	Success Rate	FPR
Microsoft	33%	5.9%
Avinash Kaushik	20%	11.1%
Bing	15%	15.0%
Booking.com, Google Ads, Netflix	10%	22.0%
Airbnb Search	8%	26.4%

7 of 16

Key Point: Surprising Results Require�Strong Evidence – Lower P-values

#7

8 of 16

Statistical Power

#8

9 of 16

Example of Power Calculation

For a website, you are interested in users purchasing something
The conversion rate (user to purchaser) is 3.7%
You are interested in ideas that improve the conversion rate by�(relative) 10% or more

A lower percentage (e.g., 5%) will require more users, so we go aggressive
That said, very few ideas can generate 10% improvement to a key metric like this.�At Bing, perhaps 1 in 10,000 experiments (but your startup isn’t optimized, so maybe)

Plug this into the power formula, and you need >41,642 users in each variant
GuessTheTest on 16 Dec 2021, shared this example, but with…

80 users in each variant, and it was stat-sig showing 337% improvement

#9

Do NOT trust experiments with low power

10 of 16

Winner’s Curse

A stat-sig result with low power has a high probability�of exaggerating the actual number as follows 🡪�(Gelman and Carlin 2014)��
The power to detect a 10% relative delta in the prior example was 3%
With such low power, the False Positive Risk is at least 63%, �so at stat-sig result is more likely to be wrong than right!
To trust it a result, a p-value threshold of 0.002 should be used. �The actual p-value was 0.013 (paper has detailed computations)

#10

Do NOT trust experiments with low power

11 of 16

Post-hoc Power Calculations are Noisy and Misleading

Pre-experiment power allows one to say that the treatment effect is unlikely to be large if we got a non-statistically significant result�(e.g., with 80% probability, it was under 10%)
After an experiment is run, some compute the “post-hoc” power based on the observed delta, a terrible idea
This is like giving odds on a horse race after seeing the outcome (Greenland)
The post-hoc power is simply a 1-1 mapping from p-value and alpha

Anything stat-sig would map to power >= 50%

Ad-hoc power leads to a paradox of power reversal (Hoenig and Heisey)

#11

12 of 16

Minimize Data Processing Options

Flexibility in data processing increases false positive rate due to multiple hypothesis testing
Examples:

Allowing users to stop experiments when stat-sig (as opposed to fixed horizon)
Outlier removal (yes/no, or worse, some %)
Selecting segments (stat-sig for males, or in country X)

With additional degrees-of-freedom, the p-value threshold needs to be adjusted

#12

Statistician: you have already calculated the p-value?

Surgeon: yes, I used multinomial logistic regression.

Statistician: Really? How did you come up with that?

Surgeon: I tried each analysis on the statistical software� dropdown menus, and that was the one �that gave the smallest p-value

-- Andrew Vickers (2009)

13 of 16

Beware of Unequal Variants

In theory, a shared control can be larger than the treatments�and you gain statistical power when comparing multiple�treatments to the control
In practice, we shared cautionary notes:

Triggering gets very complicated with the control having to compute whether the user triggered in each of multiple treatments
Cookie churn causes SRMs (Sample Ratio Mismatches) if the variants are of different sizes
Shared resources, such as LRU caches, can give performance advantages to a larger variant

The paper shares another reason: the converges to a normal distribution is faster when variants are equal. Unequal variants caused material over-estimation of type-I error on one tail and under-estimated the other tail

#13

14 of 16

Summary (1 of 2)

We shared five intuition busters and made recommendations on how to address the issues

Surprising results require strong evidence—lower p-values.�Share FPR and apply Twyman’s law to surprising results. �Do replication runs and combine them for a lower p-value
Experiments with low statistical power are NOT trustworthy�Avoid running underpowered experiments. �Do not share “interesting” results from such experiments
Post-hoc power calculations are noisy and misleading �Do not show them.�If you see others using them, explain how misleading they are

#14

15 of 16

Summary (2 of 2)

Minimize data processing options in experimentation platforms�Standard processing by default. �Optional processing should be specified pre-experiment�(or in replication run)
Beware of unequal variants�Make sure you checkoff all concerns before you use unequal variants

#15

16 of 16

#16

Q&A

To learn more about A/B tests and controlled experiments,�I teach a 10-hour Zoom class (next one Aug 22).� See https://bit.ly/ABClassRKLI

Paper at https://bit.ly/ABTestingIntuitionBusters

These slides at https://bit.ly/ABTestingIntuitionBustersTalk