1 of 31

Lecture 14

A/B Testing

Summer 2023

2 of 31

Meme Monday

3 of 31

Announcements

  • The regular midterm is 10am-12pm on Friday 7/14. See pinned Ed posts
    • Scope: Everything through Chapter 12 (including A/B testing)
  • Midterm prep posts on Ed for practice questions
  • If you did not fill out the Google form but need accommodations, email data8@berkeley.edu ASAP.No guarantees can be made.
  • Deadlines
    • HW 6 due tomorrow (EC tonight)
    • Lab 6 due today

4 of 31

Announcements

  • Assignment Grades Released
    • Lab 3, Lab 4, and HW 3
    • Regrades due 7/12 EOD
      • Written work regrades via Gradescope
      • Autograder regrades via Lab GSI
        • If you are a self-service student, reach out to Ethan (our grading lead)
    • Solutions are shared with you on bDrive

5 of 31

Weekly Goals

  • Last week
    • Assessing the consistency of the data and a model
    • Comparing two categorical distributions
    • Terminology and general method
  • Today
    • Probability of error
    • A/B tests: decisions based on comparing two random samples
  • Ahead
    • Using A/B tests to establish causality
    • Examples/Midterm Review

6 of 31

Review

P-Values

7 of 31

(In)consistency Based on Tail Area

  • Start at the observed value of the test statistic
  • Look in the direction that favors the alternative hypothesis
    • If that tail is small, the data are not consistent with the null
    • If not, the data are consistent with the null

Testing whether or not Mendel’s model is good:

  • Large values of the distance favor the alternative
  • So start at the observed distance and look to the right

8 of 31

Conventions About Inconsistency

  • “Inconsistent”: The test statistic is in the tail of the empirical distribution under the null hypothesis

  • “In the tail,” first convention:
    • The area in the tail is less than 5%
    • The result is “statistically significant”

  • “In the tail,” second convention:
    • The area in the tail is less than 1%
    • The result is “highly statistically significant”

9 of 31

Definition of the p-value

The p-value is the chance,

  • under the null hypothesis,
  • that the test statistic
  • is equal to the value that was observed in the data
  • or is even further in the direction of the alternative.

Fair, or biased towards tails?

  • The gold area approximates the p-value

10 of 31

The Cutoff as an Error Probability

11 of 31

Can the Conclusion be Wrong?

Yes.

Data consistent with the null

Data point to the alternative

Null is true

Alternative is true

12 of 31

An Error Probability

  • The cutoff for the P-value is an error probability.

  • If:
    • your cutoff is 5%
    • and the null hypothesis happens to be true

  • then there is about a 5% chance that your test will reject the null hypothesis.

13 of 31

Decision Rule and Error Probability

If you test Mendel’s model using a 5% cutoff for the p-value,

for which values of the statistic will you reject the model?

Reject

the model

5%

Consistent with

the model

14 of 31

Origin of the Conventions

15 of 31

Sir Ronald Fisher, 1890-1962

“We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right of other free minds to utilize them in making their own decisions.”

16 of 31

Fisher’s Personal Preference

“It is convenient to take this point [5%] as a limit in judging whether a deviation is to be considered significant or not.”

–– Statistical Methods for Research Workers, 1925

“If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 percent point), or one in a hundred (the 1 percent point). Personally, the author prefers to set a low standard of significance at the 5 percent point …” –– 1926

17 of 31

A/B Testing

18 of 31

Comparing Two Samples

  • Compare values of sampled individuals in Group A with values of sampled individuals in Group B.

  • Question: Do the two sets of values come from the same underlying distribution?

  • Answering this question by performing a statistical test is called A/B testing.

(Demo)

19 of 31

The Groups and the Question

  • Random sample of mothers of newborns. Compare:
    • (A) Birth weights of babies of mothers who smoked during pregnancy
    • (B) Birth weights of babies of mothers who didn’t smoke

  • Question: Could the difference be due to chance alone?

20 of 31

Hypotheses

  • Null:
    • In the population, the distributions of the birth weights of the babies in the two groups are the same. (They are different in the sample just due to chance.)
  • Alternative:
    • In the population, the babies of the mothers who smoked weigh less, on average, than the babies of the non-smokers.

21 of 31

Test Statistic

  • Group A: non-smokers
  • Group B: smokers

  • Statistic: Difference between average weights

Group B average - Group A average

  • Negative values of this statistic favor the alternative

(Demo)

22 of 31

The Data

...

Non-smoker

Non-smoker

Smoker

Non-smoker

120 oz

113 oz

128 oz

117 oz

Smoker

108 oz

...

23 of 31

Detour: Sampling

  • Null:
    • In the population, the distributions of the birth weights of the babies in the two groups are the same. (They are different in the sample just due to chance.)
  • Alternative:
    • In the population, the babies of the mothers who smoked weigh less, on average, than the babies of the non-smokers.

24 of 31

Shuffling Labels Under the Null

...

Non-smoker

Non-smoker

Smoker

Smoker

120 oz

113 oz

128 oz

117 oz

Smoker

108 oz

...

25 of 31

Shuffling Rows

26 of 31

Random Permutation

  • tbl.sample(n)
    • Table of n rows picked randomly with replacement
  • tbl.sample()
    • Table with same number of rows as original tbl, picked randomly with replacement
  • tbl.sample(n, with_replacement = False)
    • Table of n rows picked randomly without replacement
  • tbl.sample(with_replacement = False)
    • All rows of tbl, in random order

(Demo)

27 of 31

Simulating Under the Null

  • If the null is true, all rearrangements of labels are equally likely
  • Plan:
    • Shuffle all group labels
    • Assign each shuffled label to a birth weight
    • Find the difference between the averages of the two shuffled groups
    • Repeat

(Demo)

28 of 31

When to use A/B testing?

What can we conclude from the A/B test we just conducted?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

29 of 31

Hypothesis Testing Review

  • 1 Sample: One Category (e.g. percent of flowers that are purple)
    • Test Statistic: observed_proportion, abs(observed_proportion - null_proportion)
    • How to Simulate: sample_proportions(n, null_dist)
  • 1 Sample: More Than 2 Categories (e.g. office hours attendance)
    • Test Statistic: tvd(observed_dist, null_dist)
    • How to Simulate: sample_proportions(n, null_dist)
  • 1 Sample: Numerical Data (e.g. scores in a lab section)
    • Test Statistic: observed_mean, abs(observed_mean - null_mean)
    • How to Simulate: population_data.sample(n, with_replacement=False)
  • 2 Samples: Underlying Values (e.g. birth weights of smokers vs. non-smokers)
    • Test Statistic: group_a_mean - group_b_mean, group_b_mean - group_a_mean,

abs(group_a_mean - group_b_mean)

    • How to Simulate: observed_data.sample(with_replacement=False)

30 of 31

Where is A/B testing used in other applications?

31 of 31

A/B Testing Applications

  • A/B testing is heavily used in applications across the board
  • Use cases examples:
    • Medical Trials
    • User Interface/Experience Design
    • Quality Assurance
  • They often are used to test causality, but only under the right circumstances can this relationship be established
    • More on that tomorrow