1 of 16

Ronny Kohavi, Alex Deng, Lukas Vermeer�

A/B Testing Intuition Busters

Common Misunderstandings in Online Controlled Experiments

@RonnyK

© 2022 Ron Kohavi

2 of 16

Motivation

  • “6 hours of debugging can save you 5 minutes of reading documentation”� -- Tweet by Jakob Cosoroabă
  • Two weeks of deep-diving into experiment results can save you an hour of reading this paper
  • Make sure you understand p-values and statistical power
  • Be skeptical of many published results

#2

© 2022 Ron Kohavi

3 of 16

Tough Paper to Write

  • I’ve been involved in A/B tests at Amazon (Weblab), Microsoft (led �ExP, the experimentation platform), and Airbnb (ERF). �I co-authored a book on experimentation that is usually in the top 10 in�Data Mining on Amazon, and I teach a quarterly Zoom class on A/B testing.�Alex was at Microsoft ExP and Airbnb.�Lukas was director of experimentation at Booking and now Vista
  • We called out multiple vendors’ mistakes, book authors’ mistakes, and we shared (really shredded) an example from GuessTheTest where several mistakes were made (with the owner’s permission)
  • The meta-reviewer wrote

I would very much encourage the authors to reread it and tone it down in parts…�If not done so, the paper would in fact embarrass KDD for years…

#3

© 2022 Ron Kohavi

4 of 16

P-Values

Misinterpretation and abuse of statistical tests,�confidence intervals, and statistical power have�been decried for decades, yet remain rampant.�A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof

-- Greenland et al (2016)

#4

  • Vendors (e.g., Optimizely) try to hide the complexity by calling 1-(p-value) “confidence,” but it is misleading , as confidence is NOT the probability that the result is a true positive. Documentation is often wrong.
  • Book authors frequently get it wrong (see paper for examples)

© 2022 Ron Kohavi

5 of 16

What We Want vs. What we Get

#5

 

© 2022 Ron Kohavi

6 of 16

 

#6

 

Company / Source

Success Rate

FPR

Microsoft

33%

5.9%

Avinash Kaushik

20%

11.1%

Bing

15%

15.0%

Booking.com, Google Ads, Netflix

10%

22.0%

Airbnb Search

8%

26.4%

© 2022 Ron Kohavi

7 of 16

Key Point: Surprising Results Require�Strong Evidence – Lower P-values

  •  

#7

© 2022 Ron Kohavi

8 of 16

Statistical Power

  •  

#8

© 2022 Ron Kohavi

9 of 16

Example of Power Calculation

  • For a website, you are interested in users purchasing something
  • The conversion rate (user to purchaser) is 3.7%
  • You are interested in ideas that improve the conversion rate by�(relative) 10% or more
    • A lower percentage (e.g., 5%) will require more users, so we go aggressive
    • That said, very few ideas can generate 10% improvement to a key metric like this.�At Bing, perhaps 1 in 10,000 experiments (but your startup isn’t optimized, so maybe)
  • Plug this into the power formula, and you need >41,642 users in each variant
  • GuessTheTest on 16 Dec 2021, shared this example, but with…

80 users in each variant, and it was stat-sig showing 337% improvement

#9

Do NOT trust experiments with low power

© 2022 Ron Kohavi

10 of 16

Winner’s Curse

  • A stat-sig result with low power has a high probability�of exaggerating the actual number as follows 🡪�(Gelman and Carlin 2014)��
  • The power to detect a 10% relative delta in the prior example was 3%
  • With such low power, the False Positive Risk is at least 63%, �so at stat-sig result is more likely to be wrong than right!
  • To trust it a result, a p-value threshold of 0.002 should be used. �The actual p-value was 0.013 (paper has detailed computations)

#10

Do NOT trust experiments with low power

© 2022 Ron Kohavi

11 of 16

Post-hoc Power Calculations are Noisy and Misleading

  • Pre-experiment power allows one to say that the treatment effect is unlikely to be large if we got a non-statistically significant result�(e.g., with 80% probability, it was under 10%)
  • After an experiment is run, some compute the “post-hoc” power based on the observed delta, a terrible idea
  • This is like giving odds on a horse race after seeing the outcome (Greenland)
  • The post-hoc power is simply a 1-1 mapping from p-value and alpha
    • Anything stat-sig would map to power >= 50%
  • Ad-hoc power leads to a paradox of power reversal (Hoenig and Heisey)

#11

© 2022 Ron Kohavi

12 of 16

Minimize Data Processing Options

  • Flexibility in data processing increases false positive rate due to multiple hypothesis testing
  • Examples:
    • Allowing users to stop experiments when stat-sig (as opposed to fixed horizon)
    • Outlier removal (yes/no, or worse, some %)
    • Selecting segments (stat-sig for males, or in country X)
  • With additional degrees-of-freedom, the p-value threshold needs to be adjusted

#12

Statistician: you have already calculated the p-value?

Surgeon: yes, I used multinomial logistic regression.

Statistician: Really? How did you come up with that?

Surgeon: I tried each analysis on the statistical software� dropdown menus, and that was the one �that gave the smallest p-value

-- Andrew Vickers (2009)

© 2022 Ron Kohavi

13 of 16

Beware of Unequal Variants

  • In theory, a shared control can be larger than the treatments�and you gain statistical power when comparing multiple�treatments to the control
  • In practice, we shared cautionary notes:
    • Triggering gets very complicated with the control having to compute whether the user triggered in each of multiple treatments
    • Cookie churn causes SRMs (Sample Ratio Mismatches) if the variants are of different sizes
    • Shared resources, such as LRU caches, can give performance advantages to a larger variant

  • The paper shares another reason: the converges to a normal distribution is faster when variants are equal. Unequal variants caused material over-estimation of type-I error on one tail and under-estimated the other tail

#13

© 2022 Ron Kohavi

14 of 16

Summary (1 of 2)

We shared five intuition busters and made recommendations on how to address the issues

  1. Surprising results require strong evidence—lower p-values.�Share FPR and apply Twyman’s law to surprising results. �Do replication runs and combine them for a lower p-value
  2. Experiments with low statistical power are NOT trustworthy�Avoid running underpowered experiments. �Do not share “interesting” results from such experiments
  3. Post-hoc power calculations are noisy and misleading �Do not show them.�If you see others using them, explain how misleading they are

#14

© 2022 Ron Kohavi

15 of 16

Summary (2 of 2)

  1. Minimize data processing options in experimentation platforms�Standard processing by default. �Optional processing should be specified pre-experiment�(or in replication run)
  2. Beware of unequal variants�Make sure you checkoff all concerns before you use unequal variants

#15

© 2022 Ron Kohavi

16 of 16

#16

Q&A

To learn more about A/B tests and controlled experiments,�I teach a 10-hour Zoom class (next one Aug 22).� See https://bit.ly/ABClassRKLI

Paper at https://bit.ly/ABTestingIntuitionBusters

These slides at https://bit.ly/ABTestingIntuitionBustersTalk

© 2022 Ron Kohavi