1 of 22

Lecture 24

Confidence Intervals

DATA 8

Fall 2020

2 of 22

Announcements

Online tutoring sections sign ups open tonight at 6pm

Tutoring sections commence next week

Vitamins 23 and 24 due Friday 10:59am
Lab this week

3 of 22

Weekly Goals

Today

Estimation
Bootstrap
Confidence intervals

Friday

Interpreting confidence intervals

5 of 22

Computing Percentiles

The Xth percentile is first value on the sorted list that is at least as large as X% of the elements.

Example: s = [1, 7, 3, 9, 5]

s_sorted = [1, 3, 5, 7, 9]

For a percentile that does not exactly correspond to an element, take the next greater element instead

The 80th percentile is ordered element 4: (80/100) * 5

percentile(80, s) is 7

Percentile

Data set

6 of 22

The percentile Function

The pth percentile is the smallest value in a set that is at least as large as p% of the elements in the set
Function in the datascience module:

percentile(p, values)

p is between 0 and 100

Returns the pth percentile of the array

(Demo)

7 of 22

Discussion Question

Which are True, when s = [1, 7, 3, 9, 5]?

percentile(10, s) == 0

percentile(39, s) == percentile(40, s)

percentile(40, s) == percentile(41, s)

percentile(50, s) == 5

(Demo)

9 of 22

Inference: Estimation

How do we calculate the value of an unknown parameter?

If you have a census (that is, the whole population):

Just calculate the parameter and you’re done

If you don’t have a census:

Take a random sample from the population
Use a statistic as an estimate of the parameter

(Demo)

10 of 22

Variability of the Estimate

One sample ➜ One estimate

But the random sample could have come out differently

And so the estimate could have been different

Big question:

How different would it be if we did it again?

(Demo)

11 of 22

Quantifying Uncertainty

The estimate is usually not exactly right:

Estimate = Parameter + Error

How accurate is the estimate, usually?

How big is a typical error?

When we have a census, we can do this by simulation

(Demo)

12 of 22

Where to Get Another Sample?

We want to understand errors of our estimate
Given the population, we could simulate

...but we only have the sample!

To get many values of the estimate, we needed many random samples
Can’t go back and sample again from the population:

No time, no money

Stuck?

13 of 22

The Bootstrap

14 of 22

The Bootstrap

A technique for simulating repeated random sampling

All that we have is the original sample

… which is large and random
Therefore, it probably resembles the population

So we sample at random from the original sample!

15 of 22

Why the Bootstrap Works

population

sample

resamples

All of these look pretty similar, most likely.

16 of 22

Why We Need the Bootstrap

population

sample

resamples

What we wish we could get

What we really get

17 of 22

Real World vs. Bootstrap World

Real world:

True probability distribution (population)

→ Random sample 1

→ Estimate 1

→ Random sample 2

→ Estimate 2

…
→ Random sample 10000

→ Estimate 10000

Bootstrap world:

Empirical distribution of original sample (“population”)

→ Bootstrap sample 1

→ Estimate 1

→ Bootstrap sample 2

→ Estimate 2

...
→ Bootstrap sample 1000

→ Estimate 1000

Hope: these two scenarios are analogous

18 of 22

The Bootstrap Principle

The bootstrap principle:

Bootstrap-world sampling ≈ Real-world sampling

Not always true!

… but reasonable if sample is large enough

We hope that:

Variability of bootstrap estimate
Distribution of bootstrap errors

...are similar to what they are in the real world

19 of 22

Key to Resampling

From the original sample,

draw at random
with replacement
as many values as the original sample contained

The size of the new sample has to be the same as the original one, so that the two estimates are comparable

(Demo)

20 of 22

Confidence Intervals

21 of 22

95% Confidence Interval

Interval of estimates of a parameter
Based on random sampling
95% is called the confidence level

Could be any percent between 0 and 100
Higher level means wider intervals

The confidence is in the process that gives the interval:

It generates a “good” interval about 95% of the time.

(Demo)

1 of 22

2 of 22

3 of 22

4 of 22

5 of 22

6 of 22

7 of 22

8 of 22

9 of 22

10 of 22

11 of 22

12 of 22

13 of 22

14 of 22

15 of 22

16 of 22

17 of 22

18 of 22

19 of 22

20 of 22

21 of 22

22 of 22