1 of 27

Hypothesis Testing

To save and make a local (editable) copy, do: File, Make a copy. �

Advanced High School Statistics

Slides developed by Mine Çetinkaya-Rundel of OpenIntro, modified by Leah Dorazio for use with AHSS.

The slides may be copied, edited, and/or shared via the CC BY-SA license

Some images may be included under fair use guidelines (educational purposes)

2 of 27

Remember when...

p̂_males = 21 / 24 = 0.88

p̂_females = 14 / 24 = 0.58

Possible explanations:

Promotion and gender are independent, no gender discrimination, observed difference in proportions is simply due to chance.�→ null (nothing is going on)
Promotion and gender are dependent, there is gender discrimination, observed difference in proportions is not due to chance.�→ alternative (something is going on)

3 of 27

Result

Since it was quite unlikely to obtain results like the actual data or something more extreme in the simulations (male promotions being 30% or more higher than female promotions), we decided to reject the null hypothesis in favor of the alternative.

4 of 27

Recap: hypothesis testing framework

We start with a null hypothesis (H₀) that represents the status quo.

We also have an alternative hypothesis (H_A) that represents our research question, i.e. what we're testing for.

We conduct a hypothesis test under the assumption that the null hypothesis is true, either via simulation or traditional methods based on the central limit theorem (coming up next...).

If the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, we stick with the null hypothesis. If they do, then we reject the null hypothesis in favor of the alternative.

We'll formally introduce the hypothesis testing framework using an example on testing a claim about a population mean.

5 of 27

Testing hypotheses using confidence intervals

The associated hypotheses are:

H₀: µ = 3: College students have been in 3 exclusive relationships, on average

H_A: µ > 3: College students have been in more than 3 exclusive relationships, on average

Earlier we calculated a 95% confidence interval for the average number of exclusive relationships college students have been in to be (2.7, 3.7). Based on this confidence interval, do these data support the hypothesis that college students on average have been in more than 3 exclusive relationships.

Since the null value is included in the interval, we do not reject the null hypothesis in favor of the alternative.
This is a quick-and-dirty approach for hypothesis testing. However it doesn't tell us the likelihood of certain outcomes under the null hypothesis, i.e. the p-value, based on which we can make a decision on the hypotheses.

6 of 27

Number of college applications

A similar survey asked how many colleges students applied to, and 206 students responded to this question. This sample yielded an average of 9.7 college applications with a standard deviation of 7. College Board website states that counselors recommend students apply to roughly 8 colleges. Do these data provide convincing evidence that the average number of colleges all Duke students apply to is higher than recommended?

http://www.collegeboard.com/student/apply/the-application/151680.html

7 of 27

Number of college applications - conditions

Which of the following is not a condition that needs to be met to proceed with this hypothesis test?

Students in the sample should be independent of each other with respect to how many colleges they applied to.
Sampling should have been done randomly.
The sample size should be less than 10% of the population of all Duke students.
There should be at least 10 successes and 10 failures in the sample.
The distribution of the number of colleges students apply to should not be extremely skewed.

8 of 27

Number of college applications - conditions

Which of the following is not a condition that needs to be met to proceed with this hypothesis test?

Students in the sample should be independent of each other with respect to how many colleges they applied to.
Sampling should have been done randomly.
The sample size should be less than 10% of the population of all Duke students.
There should be at least 10 successes and 10 failures in the sample.
The distribution of the number of colleges students apply to should not be extremely skewed.

9 of 27

p-values

We then use this test statistic to calculate the p-value, the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis were true.
If the p-value is low (lower than the significance level, α, which is usually 5%) we say that it would be very unlikely to observe the data if the null hypothesis were true, and hence reject H₀.
If the p-value is high (higher than α) we say that it is likely to observe the data even if the null hypothesis were true, and hence do not reject H₀.

10 of 27

Number of college applications - p-value

P(x̄ > 9.7 | µ = 8) = P(Z > 3.4) = 0.0003

p-value: probability of observing data at least as favorable to H_A as our current data set (a sample mean greater than 9.7), if in fact H₀ were true (the true population mean was 8).

12 of 27

Number of college applications - Making a decision

p-value = 0.0003

If the true average of the number of colleges Duke students applied to is 8, there is only 0.03% chance of observing a random sample of 206 Duke students who on average apply to 9.7 or more schools.
This is a pretty low probability for us to think that a sample mean of 9.7 or more schools is likely to happen simply by chance.

Since p-value is low (lower than 5%) we reject H₀.

The data provide convincing evidence that Duke students apply to more than 8 schools on average.

The difference between the null value of 8 schools and observed sample mean of 9.7 schools is not due to chance or sampling variability.

13 of 27

Practice

A poll by the National Sleep Foundation found that college students average about 7 hours of sleep per night. A sample of 169 college students taking an introductory statistics class yielded an average of 6.88 hours, with a standard deviation of 0.94 hours. Assuming that this is a random sample representative of all college students (bit of a leap of faith?), a hypothesis test was conducted to evaluate if college students on average sleep less than 7 hours per night. The p-value for this hypothesis test is 0.0485. Which of the following is correct?

Fail to reject H₀, the data provide convincing evidence that college students sleep less than 7 hours on average.
Reject H₀, the data provide convincing evidence that college students sleep less than 7 hours on average.
Reject H₀, the data prove that college students sleep more than 7 hours on average.
Fail to reject H₀, the data do not provide convincing evidence that college students sleep less than 7 hours on average.
Reject H₀, the data provide convincing evidence that college students in this sample sleep less than 7 hours on average.

14 of 27

Practice

Fail to reject H₀, the data provide convincing evidence that college students sleep less than 7 hours on average.
Reject H₀, the data provide convincing evidence that college students sleep less than 7 hours on average.
Reject H₀, the data prove that college students sleep more than 7 hours on average.
Fail to reject H₀, the data do not provide convincing evidence that college students sleep less than 7 hours on average.
Reject H₀, the data provide convincing evidence that college students in this sample sleep less than 7 hours on average.

15 of 27

Practice

Fail to reject H₀, the data provide convincing evidence that college students sleep less than 7 hours on average.
Reject H₀, the data provide convincing evidence that college students sleep less than 7 hours on average.
Reject H₀, the data prove that college students sleep more than 7 hours on average.
Fail to reject H₀, the data do not provide convincing evidence that college students sleep less than 7 hours on average.
Reject H₀, the data provide convincing evidence that college students in this sample sleep less than 7 hours on average.

16 of 27

Two-sided hypothesis testing with p-values

Hence the p-value would change as well:

If the research question was “Do the data provide convincing evidence that the average amount of sleep college students get per night is different than the national average?”, the alternative hypothesis would be different.

H₀: µ = 7

H_A: µ ≠ 7

17 of 27

Decision errors

Hypothesis tests are not flawless.

In the court system innocent people are sometimes wrongly convicted, and the guilty sometimes walk free.
Similarly, we can make a wrong decision in statistical hypothesis tests as well.
The difference is that we have the tools necessary to quantify how often we make errors in statistics.

18 of 27

Decision errors (cont.)

There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a decision about which might be true, but our choice might be incorrect.

19 of 27

Decision errors (cont.)

A Type 1 Error is rejecting the null hypothesis when H₀ is true.

There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a decision about which might be true, but our choice might be incorrect.

20 of 27

Decision errors (cont.)

We (almost) never know if H₀ or H_A is true, but we need to consider all possibilities.

A Type 1 Error is rejecting the null hypothesis when H₀ is true.

A Type 2 Error is failing to reject the null hypothesis when H_A is true.

There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a decision about which might be true, but our choice might be incorrect.

21 of 27

Hypothesis Test as a trial

If we again think of a hypothesis test as a criminal trial then it makes sense to frame the verdict in terms of the null and alternative hypotheses:

H₀: Defendant is innocent

H_A: Defendant is guilty

Which type of error is being committed in the following circumstances?

Declaring the defendant innocent when they are actually guilty

Declaring the defendant guilty when they are actually innocent

22 of 27

Hypothesis Test as a trial

If we again think of a hypothesis test as a criminal trial then it makes sense to frame the verdict in terms of the null and alternative hypotheses:

H₀: Defendant is innocent

H_A: Defendant is guilty

Which type of error is being committed in the following circumstances?

Declaring the defendant innocent when they are actually guilty

Type 2 error

Declaring the defendant guilty when they are actually innocent

23 of 27

Hypothesis Test as a trial

If we again think of a hypothesis test as a criminal trial then it makes sense to frame the verdict in terms of the null and alternative hypotheses:

H₀: Defendant is innocent

H_A: Defendant is guilty

Which type of error is being committed in the following circumstances?

Declaring the defendant innocent when they are actually guilty

Type 2 error

Declaring the defendant guilty when they are actually innocent

Type 1 error

Which error do you think is the worse error to make?

“better that ten guilty persons escape than that one innocent suffer”�- William Blackstone

24 of 27

Type 1 error rate

As a general rule we reject H₀ when the p-value is less than 0.05, i.e. we use a significance level of 0.05, α = 0.05.

This means that, for those cases where H₀ is actually true, we do not want to incorrectly reject it more than 5% of those times.

In other words, when using a 5% significance level there is about 5% chance of making a Type 1 error if the null hypothesis is true.� P(Type 1 error | H₀ true) = α

This is why we prefer small values of α -- increasing α increases the Type 1 error rate.

25 of 27

Choosing a significance level

Choosing a significance level for a test is important in many contexts, and the traditional level is 0.05. However, it is often helpful to adjust the significance level based on the application.

We may select a level that is smaller or larger than 0.05 depending on the consequences of any conclusions reached from the test.

If making a Type 1 Error is dangerous or especially costly, we should choose a small significance level (e.g. 0.01). Under this scenario we want to be very cautious about rejecting the null hypothesis, so we demand very strong evidence favoring H_A before we would reject H₀.

If a Type 2 Error is relatively more dangerous or much more costly than a Type 1 Error, then we should choose a higher significance level (e.g. 0.10). Here we want to be cautious about failing to reject H₀ when the null is actually false.

26 of 27

Recap: Hypothesis testing framework

1. Set the hypotheses. � For a single proportion this will look like:� H₀: p = null value� H_A: p < or > or ≠ null value

2. Check assumptions and conditions

3. Calculate a test statistic and a p-value

4. Make a decision, and interpret it in context

If p-value < α, reject H₀, � there is sufficient evidence for [H_A]
If p-value > α, do not reject H₀, � there is not sufficient for evidence for [H_A]

27 of 27

Explore more free resources at openintro.org/ahs s, including:

AHSS Textbook
Videos - content videos, worked examples, TI-84 and Casio tutorials
Slides
Data Sets
Desmos Activities
Interactive Tableau graphs
Statistical Software Labs
Discussion Forums (free support for students and teachers)

Teachers only content is also available for Verified Teachers, including

Exercise solutions
Sample exams
Ability to request a free desk copy for a course
Statistics Teachers email group

Questions? Contact us.

1 of 27

2 of 27

3 of 27

4 of 27

5 of 27

6 of 27

7 of 27

8 of 27

9 of 27

10 of 27

11 of 27

12 of 27

13 of 27

14 of 27

15 of 27

16 of 27

17 of 27

18 of 27

19 of 27

20 of 27

21 of 27

22 of 27

23 of 27

24 of 27

25 of 27

26 of 27

27 of 27