1 of 31

Lecture 14

A/B Testing

Summer 2023

3 of 31

Announcements

The regular midterm is 10am-12pm on Friday 7/14. See pinned Ed posts

Scope: Everything through Chapter 12 (including A/B testing)

Midterm prep posts on Ed for practice questions
If you did not fill out the Google form but need accommodations, email data8@berkeley.edu ASAP.No guarantees can be made.
Deadlines

HW 6 due tomorrow (EC tonight)
Lab 6 due today

4 of 31

Announcements

Assignment Grades Released

Lab 3, Lab 4, and HW 3
Regrades due 7/12 EOD

Written work regrades via Gradescope
Autograder regrades via Lab GSI

If you are a self-service student, reach out to Ethan (our grading lead)

Solutions are shared with you on bDrive

5 of 31

Weekly Goals

Last week

Assessing the consistency of the data and a model
Comparing two categorical distributions
Terminology and general method

Today

Probability of error
A/B tests: decisions based on comparing two random samples

Ahead

Using A/B tests to establish causality
Examples/Midterm Review

6 of 31

Review

P-Values

7 of 31

(In)consistency Based on Tail Area

Start at the observed value of the test statistic
Look in the direction that favors the alternative hypothesis

If that tail is small, the data are not consistent with the null
If not, the data are consistent with the null

Testing whether or not Mendel’s model is good:

Large values of the distance favor the alternative
So start at the observed distance and look to the right

8 of 31

Conventions About Inconsistency

“Inconsistent”: The test statistic is in the tail of the empirical distribution under the null hypothesis

“In the tail,” first convention:

The area in the tail is less than 5%
The result is “statistically significant”

“In the tail,” second convention:

The area in the tail is less than 1%
The result is “highly statistically significant”

9 of 31

Definition of the p-value

The p-value is the chance,

under the null hypothesis,
that the test statistic
is equal to the value that was observed in the data
or is even further in the direction of the alternative.

Fair, or biased towards tails?

The gold area approximates the p-value

10 of 31

The Cutoff as an Error Probability

11 of 31

Can the Conclusion be Wrong?

Yes.

	Data consistent with the null	Data point to the alternative
Null is true	✅	❌
Alternative is true	❌	✅

12 of 31

An Error Probability

The cutoff for the P-value is an error probability.

your cutoff is 5%
and the null hypothesis happens to be true

then there is about a 5% chance that your test will reject the null hypothesis.

13 of 31

Decision Rule and Error Probability

If you test Mendel’s model using a 5% cutoff for the p-value,

for which values of the statistic will you reject the model?

Reject

the model

Consistent with

the model

14 of 31

Origin of the Conventions

15 of 31

Sir Ronald Fisher, 1890-1962

“We have the duty of formulating, of summarizing, and of communicating our conclusions, in intelligible form, in recognition of the right of other free minds to utilize them in making their own decisions.”

16 of 31

Fisher’s Personal Preference

“It is convenient to take this point [5%] as a limit in judging whether a deviation is to be considered significant or not.”

–– Statistical Methods for Research Workers, 1925

“If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 percent point), or one in a hundred (the 1 percent point). Personally, the author prefers to set a low standard of significance at the 5 percent point …” –– 1926

17 of 31

A/B Testing

18 of 31

Comparing Two Samples

Compare values of sampled individuals in Group A with values of sampled individuals in Group B.

Question: Do the two sets of values come from the same underlying distribution?

Answering this question by performing a statistical test is called A/B testing.

(Demo)

19 of 31

The Groups and the Question

Random sample of mothers of newborns. Compare:

(A) Birth weights of babies of mothers who smoked during pregnancy
(B) Birth weights of babies of mothers who didn’t smoke

Question: Could the difference be due to chance alone?

20 of 31

Hypotheses

Null:

In the population, the distributions of the birth weights of the babies in the two groups are the same. (They are different in the sample just due to chance.)

Alternative:

In the population, the babies of the mothers who smoked weigh less, on average, than the babies of the non-smokers.

21 of 31

Test Statistic

Group A: non-smokers
Group B: smokers

Statistic: Difference between average weights

Group B average - Group A average

Negative values of this statistic favor the alternative

(Demo)

22 of 31

The Data

...

Non-smoker

Smoker

Non-smoker

120 oz

113 oz

128 oz

117 oz

Smoker

108 oz

...

23 of 31

Detour: Sampling

Null:

In the population, the distributions of the birth weights of the babies in the two groups are the same. (They are different in the sample just due to chance.)

Alternative:

In the population, the babies of the mothers who smoked weigh less, on average, than the babies of the non-smokers.

24 of 31

Shuffling Labels Under the Null

...

Non-smoker

Smoker

120 oz

113 oz

128 oz

117 oz

Smoker

108 oz

...

25 of 31

Shuffling Rows

26 of 31

Random Permutation

tbl.sample(n)

Table of n rows picked randomly with replacement

tbl.sample()

Table with same number of rows as original tbl, picked randomly with replacement

tbl.sample(n, with_replacement = False)

Table of n rows picked randomly without replacement

tbl.sample(with_replacement = False)

All rows of tbl, in random order

(Demo)

27 of 31

Simulating Under the Null

If the null is true, all rearrangements of labels are equally likely
Plan:

Shuffle all group labels
Assign each shuffled label to a birth weight
Find the difference between the averages of the two shuffled groups
Repeat

(Demo)

28 of 31

When to use A/B testing?

What can we conclude from the A/B test we just conducted?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

29 of 31

Hypothesis Testing Review

1 Sample: One Category (e.g. percent of flowers that are purple)

Test Statistic: observed_proportion, abs(observed_proportion - null_proportion)
How to Simulate: sample_proportions(n, null_dist)

1 Sample: More Than 2 Categories (e.g. office hours attendance)

Test Statistic: tvd(observed_dist, null_dist)
How to Simulate: sample_proportions(n, null_dist)

1 Sample: Numerical Data (e.g. scores in a lab section)

Test Statistic: observed_mean, abs(observed_mean - null_mean)
How to Simulate: population_data.sample(n, with_replacement=False)

2 Samples: Underlying Values (e.g. birth weights of smokers vs. non-smokers)

Test Statistic: group_a_mean - group_b_mean, group_b_mean - group_a_mean,

abs(group_a_mean - group_b_mean)

How to Simulate: observed_data.sample(with_replacement=False)

30 of 31

Where is A/B testing used in other applications?

31 of 31

A/B Testing Applications

A/B testing is heavily used in applications across the board
Use cases examples:

Medical Trials
User Interface/Experience Design
Quality Assurance

They often are used to test causality, but only under the right circumstances can this relationship be established

More on that tomorrow