1 of 33

Lecture 26

Center and Spread

DATA 8

Spring 2020

2 of 33

Weekly Goals

  • Today
    • Describing distributions: “center” and “spread”
    • How big are most of the values?
  • Wednesday
    • The bell shaped curve and its relation to large random samples
  • Friday
    • The variability in a random sample average
    • Choosing the size of a random sample

3 of 33

Confidence Intervals For Testing

4 of 33

Using a CI for Testing

What if we want to do a hypothesis test, but we can’t simulate under the null?

  • Null hypothesis: Population average = x
  • Alternative hypothesis: Population average ≠ x
  • Cutoff for P-value: p%
  • Method:
    • Construct a (100-p)% confidence interval for the population average
    • If x is not in the interval, reject the null
    • If x is in the interval, can’t reject the null

5 of 33

Center and Spread

6 of 33

Questions

  • How can we quantify natural concepts like “center” and “variability”?

  • Why do many of the empirical distributions that we generate come out bell shaped?

  • How is sample size related to the accuracy of an estimate?

7 of 33

Average

8 of 33

The Average (or Mean)

Data: 2, 3, 3, 9 Average = (2+3+3+9)/4 = 4.25

  • Need not be a value in the collection
  • Need not be an integer even if the data are integers
  • Somewhere between min and max, but not necessarily halfway in between
  • Same units as the data
  • Smoothing operator: collect all the contributions in one big pot, then split evenly

(Demo)

9 of 33

Discussion Question

Are the medians of these two distributions the same or different? Are the means the same or different? If you say “different,” then say which one is bigger.

10 of 33

Comparing Mean and Median

  • Mean: Balance point of the histogram
  • Median: Half-way point of data; half the area of histogram is on either side of median
  • If the distribution is symmetric about a value, then that value is both the average and the median.
  • If the histogram is skewed, then the mean is pulled away from the median in the direction of the tail.

11 of 33

Standard Deviation

12 of 33

Defining Variability

Plan A: “biggest value - smallest value”

  • Doesn’t tell us much about the shape of the distribution

Plan B:

  • Measure variability around the mean
  • Need to figure out a way to quantify this

(Demo)

13 of 33

How Far from the Average?

  • Standard deviation (SD) measures roughly how far the data are from their average

  • SD = root mean square of deviations from average

5 4 3 2 1

  • SD has the same units as the data

14 of 33

Why Use the SD?

  • The first reason:

No matter what the shape of the distribution,

the bulk of the data are in the range “average ± a few SDs”

There are two main reasons.

  • The second reason:

Coming up next time.

15 of 33

Chebyshev's Inequality

16 of 33

How Big are Most of the Values?

No matter what the shape of the distribution,

the bulk of the data are in the range “average ± a few SDs”

Chebyshev’s Inequality

No matter what the shape of the distribution,

the proportion of values in the range “average ± z SDs” is

at least 1 - 1/z²

17 of 33

Chebyshev’s Bounds

Range

Proportion

average ± 2 SDs

at least 1 - 1/4 (75%)

average ± 3 SDs

at least 1 - 1/9 (88.888…%)

average ± 4 SDs

at least 1 - 1/16 (93.75%)

average ± 5 SDs

at least 1 - 1/25 (96%)

No matter what the distribution looks like

(Demo)

18 of 33

Standard Units

19 of 33

Standard Units

  • How many SDs above average?
  • z = (value - average)/SD
    • Negative z: value below average
    • Positive z: value above average
    • z = 0: value equal to average
  • When values are in standard units: average = 0, SD = 1
  • Chebyshev: At least 96% of the values of z are between -5 and 5

(Demo)

20 of 33

Discussion Question

Find whole numbers that are close to:

  1. the average age

  • the SD of the ages

(Demo)

21 of 33

The SD and the Histogram

  • Usually, it's not easy to estimate the SD by looking at a histogram.

  • But if the histogram has a bell shape, then you can.

22 of 33

The SD and Bell-Shaped Curves

If a histogram is bell-shaped, then

  • the average is at the center

  • the SD is the distance between the average and the points of inflection on either side

(Demo)

23 of 33

Point of Inflection

24 of 33

The Normal Distribution

25 of 33

The Standard Normal Curve

A beautiful formula that we won’t use at all:

26 of 33

Bell Curve

27 of 33

Normal Proportions

28 of 33

How Big are Most of the Values?

No matter what the shape of the distribution,

the bulk of the data are in the range “average ± a few SDs”

If a histogram is bell-shaped, then

  • Almost all of the data are in the range

“average ± 3 SDs”

29 of 33

Bounds and Normal Approximations

30 of 33

A “Central” Area

31 of 33

Central Limit Theorem

32 of 33

Sample Averages

  • The Central Limit Theorem describes how the normal distribution (a bell-shaped curve) is connected to random sample averages.
  • We care about sample averages because they estimate population averages.

33 of 33

Central Limit Theorem

If the sample is

  • large, and
  • drawn at random with replacement,

Then, regardless of the distribution of the population,

the probability distribution of the sample sum

(or the sample average) is roughly normal

(Demo)