1 of 33

Lecture 26

Center and Spread

DATA 8

Spring 2020

2 of 33

Weekly Goals

Today

Describing distributions: “center” and “spread”
How big are most of the values?

Wednesday

The bell shaped curve and its relation to large random samples

Friday

The variability in a random sample average
Choosing the size of a random sample

3 of 33

Confidence Intervals For Testing

4 of 33

Using a CI for Testing

What if we want to do a hypothesis test, but we can’t simulate under the null?

Null hypothesis: Population average = x
Alternative hypothesis: Population average ≠ x
Cutoff for P-value: p%
Method:

Construct a (100-p)% confidence interval for the population average
If x is not in the interval, reject the null
If x is in the interval, can’t reject the null

5 of 33

Center and Spread

6 of 33

Questions

How can we quantify natural concepts like “center” and “variability”?

Why do many of the empirical distributions that we generate come out bell shaped?

How is sample size related to the accuracy of an estimate?

8 of 33

The Average (or Mean)

Data: 2, 3, 3, 9 Average = (2+3+3+9)/4 = 4.25

Need not be a value in the collection
Need not be an integer even if the data are integers
Somewhere between min and max, but not necessarily halfway in between
Same units as the data
Smoothing operator: collect all the contributions in one big pot, then split evenly

(Demo)

9 of 33

Discussion Question

Are the medians of these two distributions the same or different? Are the means the same or different? If you say “different,” then say which one is bigger.

10 of 33

Comparing Mean and Median

Mean: Balance point of the histogram
Median: Half-way point of data; half the area of histogram is on either side of median
If the distribution is symmetric about a value, then that value is both the average and the median.
If the histogram is skewed, then the mean is pulled away from the median in the direction of the tail.

11 of 33

Standard Deviation

12 of 33

Defining Variability

Plan A: “biggest value - smallest value”

Doesn’t tell us much about the shape of the distribution

Plan B:

Measure variability around the mean
Need to figure out a way to quantify this

(Demo)

13 of 33

How Far from the Average?

Standard deviation (SD) measures roughly how far the data are from their average

SD = root mean square of deviations from average

5 4 3 2 1

SD has the same units as the data

14 of 33

Why Use the SD?

The first reason:

No matter what the shape of the distribution,

the bulk of the data are in the range “average ± a few SDs”

There are two main reasons.

The second reason:

Coming up next time.

15 of 33

Chebyshev's Inequality

16 of 33

How Big are Most of the Values?

No matter what the shape of the distribution,

the bulk of the data are in the range “average ± a few SDs”

Chebyshev’s Inequality

No matter what the shape of the distribution,

the proportion of values in the range “average ± z SDs” is

at least 1 - 1/z²

17 of 33

Chebyshev’s Bounds

Range	Proportion
average ± 2 SDs	at least 1 - 1/4 (75%)
average ± 3 SDs	at least 1 - 1/9 (88.888…%)
average ± 4 SDs	at least 1 - 1/16 (93.75%)
average ± 5 SDs	at least 1 - 1/25 (96%)

No matter what the distribution looks like

(Demo)

18 of 33

Standard Units

19 of 33

Standard Units

How many SDs above average?
z = (value - average)/SD

Negative z: value below average
Positive z: value above average
z = 0: value equal to average

When values are in standard units: average = 0, SD = 1
Chebyshev: At least 96% of the values of z are between -5 and 5

(Demo)

20 of 33

Discussion Question

Find whole numbers that are close to:

the average age

the SD of the ages

(Demo)

21 of 33

The SD and the Histogram

Usually, it's not easy to estimate the SD by looking at a histogram.

But if the histogram has a bell shape, then you can.

22 of 33

The SD and Bell-Shaped Curves

If a histogram is bell-shaped, then

the average is at the center

the SD is the distance between the average and the points of inflection on either side

(Demo)

23 of 33

Point of Inflection

24 of 33

The Normal Distribution

25 of 33

The Standard Normal Curve

A beautiful formula that we won’t use at all:

27 of 33

Normal Proportions

28 of 33

How Big are Most of the Values?

No matter what the shape of the distribution,

the bulk of the data are in the range “average ± a few SDs”

If a histogram is bell-shaped, then

Almost all of the data are in the range

“average ± 3 SDs”

29 of 33

Bounds and Normal Approximations

30 of 33

A “Central” Area

31 of 33

Central Limit Theorem

32 of 33

Sample Averages

The Central Limit Theorem describes how the normal distribution (a bell-shaped curve) is connected to random sample averages.
We care about sample averages because they estimate population averages.

33 of 33

Central Limit Theorem

If the sample is

large, and
drawn at random with replacement,

Then, regardless of the distribution of the population,

the probability distribution of the sample sum

(or the sample average) is roughly normal

(Demo)

1 of 33

2 of 33

3 of 33

4 of 33

5 of 33

6 of 33

7 of 33

8 of 33

9 of 33

10 of 33

11 of 33

12 of 33

13 of 33

14 of 33

15 of 33

16 of 33

17 of 33

18 of 33

19 of 33

20 of 33

21 of 33

22 of 33

23 of 33

24 of 33

25 of 33

26 of 33

27 of 33

28 of 33

29 of 33

30 of 33

31 of 33

32 of 33

33 of 33