1 of 25

Lecture 19

P-values; A/B Testing

DATA 8

Fall 2018

Slides created by John DeNero (denero@berkeley.edu) and Ani Adhikari (adhikari@berkeley.edu)

2 of 25

Announcements

3 of 25

Statistical Significance

4 of 25

The GSI’s Defense

GSI’s position (Null Hypothesis):

If we had picked my section at random from the whole class, we could have got an average like this one.

Alternative:

No, the average score is too low. Randomness is not the only reason for the low scores.

5 of 25

Quantifying Conclusions

How big a coincidence would you have to accept, to believe the null hypothesis?

Evaluating GSI's defense hypothesis

This area is how big a coincidence

7 of 25

Conventions About Inconsistency

“Inconsistent”: The observed test statistic is in the tail of the empirical distribution under the null hypothesis

“In the tail,” first convention:

The area in the tail is less than 5%
The result is “statistically significant”

“In the tail,” second convention:

The area in the tail is less than 1%
The result is “highly statistically significant”

(Demo)

8 of 25

Definition of the P-value

Formal name: observed significance level

The P-value is the chance,

under the null hypothesis,
that the test statistic
is equal to the value that was observed in the data
or is even further in the direction of the alternative.

9 of 25

Quantifying Conclusions

P(the test statistic would be equal to or more extreme� than the observed test statistic under the null hypothesis)

Evaluating Mendel's pea flower hypothesis

This area is the P-value (approximately)

10 of 25

An Error Probability

11 of 25

Can the Conclusion be Wrong?

Yes.

	Null is true	Alternative is true
Test rejects the null	❌	✅
Test doesn’t reject the null	✅	❌

12 of 25

An Error Probability

The cutoff for the P-value is an error probability.

your cutoff is 5%
and the null hypothesis happens to be true

then there is about a 5% chance that your test will reject the null hypothesis.

(Demo)

13 of 25

Origin of the Conventions

14 of 25

Sir Ronald Fisher, 1890-1962

15 of 25

Sir Ronald Fisher, 1925

“It is convenient to take this point [5%] as a limit in judging whether a deviation is to be considered significant or not.”

–– Statistical Methods for Research Workers

16 of 25

Sir Ronald Fisher, 1926

“If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 percent point), or one in a hundred (the 1 percent point). Personally, the author prefers to set a low standard of significance at the 5 percent point …”

17 of 25

Review / P-values

18 of 25

A/B Testing

19 of 25

Comparing Two Samples

Compare values of sampled individuals in Group A with values of sampled individuals in Group B.

Question: Do the two sets of values come from the same underlying distribution?

Answering this question by performing a statistical test is called A/B testing.

(Demo)

20 of 25

The Groups and the Question

Random sample of mothers of newborns. Compare:

(A) Birth weights of babies of mothers who smoked during pregnancy
(B) Birth weights of babies of mothers who didn’t smoke

Question: Could the difference be due to chance alone?

21 of 25

Hypotheses

Null:

In the population, the distributions of the birth weights of the babies in the two groups are the same. (They are different in the sample just due to chance.)

Alternative:

In the population, the babies of the mothers who smoked weighed less, on average, than the babies of the non-smokers.

22 of 25

Test Statistic

Group A: smokers
Group B: non-smokers

Statistic: Difference between average weights

Group B average - Group A average

Large values of this statistic favor the alternative

23 of 25

Simulating Under the Null

...

Non-smoker

Smoker

120 oz

113 oz

128 oz

108 oz

Non-smoker

136 oz

24 of 25

Simulating Under the Null

...

Non-smoker

Smoker

120 oz

113 oz

128 oz

108 oz

Non-smoker

136 oz

25 of 25

Simulating Under the Null

If the null is true, all rearrangements of the birth weights among the two groups are equally likely
Plan:

Shuffle all the birth weights
Assign some to “Group A” and the rest to “Group B”, maintaining the two sample sizes
Find the difference between the averages of the two shuffled groups
Repeat

(Demo)