1 of 27

Lecture 27

Sample Means

DATA 8

Fall 2023

2 of 27

Announcements

  • Homework 8 due tonight at 11pm
  • Lab 8 due Friday at 11pm
  • Project 2 will be out on Friday

3 of 27

Weekly Goals

  • Monday
    • The bell shaped curve and its relation to large random samples
  • Today
    • Central limit theorem
    • The variability in a random sample average
  • Friday
    • Choosing the size of a random sample

4 of 27

Central Limit Theorem

5 of 27

Sample Averages

  • The Central Limit Theorem describes how the normal distribution (a bell-shaped curve) is connected to random sample averages.
  • We care about sample averages because they estimate population averages.

6 of 27

Central Limit Theorem

If the sample is

  • large, and
  • drawn at random with replacement,

Then, regardless of the distribution of the population,

the probability distribution of the sample average

is roughly normal

7 of 27

Distribution of the �Sample Average

8 of 27

Why is There a Distribution?

  • You have only one random sample, and it has only one average.

  • But the sample could have come out differently.

  • And then the sample average might have been different.

  • So there are many possible sample averages.

9 of 27

Distribution of the Sample Average

  • Imagine all possible random samples of the same size as yours. There are lots of them.

  • Each of these samples has an average.

  • The distribution of the sample average is the distribution of the averages of all the possible samples.

(Demo)

10 of 27

Specifying the Distribution

Suppose the random sample is large.

  • We have seen that the distribution of the sample average is roughly bell shaped.

  • Important questions remain:
    • Where is the center of that bell curve?
    • How wide is that bell curve?

11 of 27

Center of the Distribution

12 of 27

The Population Average

The distribution of the sample average is roughly a bell curve centered at the population average.

(Demo)

13 of 27

Variability of the Sample Average

14 of 27

Why Is This Important?

  • Along with the center, the spread helps identify exactly which normal curve is the distribution of the sample average.
  • The variability of the sample average helps us measure how accurate the sample average is as an estimate of the population average.
  • If we want a specified level of accuracy, understanding the variability of the sample average helps us work out how large our sample has to be.

(Demo)

15 of 27

Discussion Question

The gold histogram shows the distribution of __________ values, each of which is _________________________.

  1. 900
  2. 10,000
  3. a randomly sampled flight delay
  4. an average of flight delays

16 of 27

The Two Histograms

  • The gold histogram shows the distribution of 10,000 values, each of which is an average of 900 randomly sampled flight delays.
  • The blue histogram shows the distribution of 10,000 values, each of which is an average of 400 randomly sampled flight delays.
  • Both are roughly bell shaped.
  • The larger the sample size, the narrower the bell.

(Demo)

17 of 27

Variability of the Sample Average

  • The distribution of all possible sample averages of a given size is called the distribution of the sample average.
  • We approximate it by an empirical distribution.
  • By the CLT, it’s roughly normal:
    • Center = the population average
    • SD = (population SD) / √sample size

18 of 27

Discussion Question

A city has 500,000 households. The annual incomes of these households have an average of $65,000 and an SD of $45,000. The distribution of the incomes [pick one and explain]:

  1. is roughly normal because the number of households is large.
  2. is not close to normal.
  3. may be close to normal, or not; we can’t tell from the information given.

19 of 27

Central Limit Theorem

If the sample is large and drawn at random with replacement,

Then, regardless of the distribution of the population,

  • the probability distribution of the sample average:
  • is roughly normal
  • mean = population mean
  • SD = (population SD) / √sample size

20 of 27

Discussion Question

A population has average 70 and SD 10. One of the histograms below is the empirical distribution of the averages of 10,000 random samples of size 100 drawn from the population. Which one?

(A)

(B)

(C)

21 of 27

Three Different SDs

Population of flight delays

  • Population mean:
  • Population SD: 27 minutes

Random sample of 100 flights

  • Sample mean: (estimate of )
  • Sample SD: estimate of population SD

SD of sample average: 27/sqrt(100) = 2.7

  • If we instead calculated from 10,000 samples, their SD would be ~0.27

22 of 27

Confidence Intervals

23 of 27

Graph of the Distribution

24 of 27

The Key to 95% Confidence

  • For about 95% of all samples, the sample average and population average are within 2 SDs of each other.

  • SD = SD of sample average

= (population SD) / √sample size

1 SD above the mean

2 SDs above the mean

25 of 27

Constructing the Interval

For 95% of all samples,

  • If you stand at the population average and look two SDs on both sides, you will find the sample average.

  • Distance is symmetric.

  • So if you stand at the sample average and look two SDs on both sides, you will capture the population average.

26 of 27

The Interval

27 of 27

Width of the Interval

Total width of a 95% confidence interval for the population average

= 4 * SD of the sample average

= 4 * (population SD) / √sample size