1 of 21

Lecture 23

Confidence Intervals

DATA 8

Spring 2022

2 of 21

Announcements

Congrats on finishing the midterm!

Scores will be out soon (no later than Friday)

No homework due this week!
Lab 7 due Friday at 11pm
We’re over halfway through!

You’ve got this!

4 of 21

Computing Percentiles

The pth percentile is first value on the sorted list that is at least as large as p% of the elements.

Example: s = [1, 7, 3, 9, 5]

s_sorted = [1, 3, 5, 7, 9]

If p% does not exactly correspond to an element (e.g. 85th percentile), take the next greater element (9).

The 80th percentile is ordered element 4: (80/100) * 5

percentile(80, s) is 7

Percentile

Data array

5 of 21

The percentile Function

The pth percentile of a set of numbers is the smallest value in the set that is at least as large as p% of the elements in the set
Function in the datascience module:

percentile(p, values_array)

p is between 0 and 100

Evaluates to the pth percentile of the array

(Demo)

6 of 21

Discussion Question

Which are True, when s = [1, 5, 7, 3, 9]?

percentile(10, s) == 0

percentile(39, s) == percentile(40, s)

percentile(40, s) == percentile(41, s)

percentile(50, s) == 5

(Demo)

8 of 21

Inference: Estimation

How can we figure out the value of an unknown parameter?

If you have a census (that is, the whole population):

Just calculate the parameter and you’re done

If you don’t have a census:

Take a random sample from the population
Use a statistic as an estimate of the parameter

(Demo)

9 of 21

Variability of the Estimate

One sample ➜ One estimate

But the random sample could have come out differently

And so the estimate could have been different

Big question:

How different would it be if we did it again?

(Demo)

10 of 21

Quantifying Uncertainty

The estimate is usually not exactly right

How accurate is the estimate, usually?

If we already have a census, we can check this by comparing the estimate and the parameter

(Demo)

11 of 21

Where to Get Another Sample?

We want to understand variability of our estimate
Given the population, we could simulate

...but we only have the sample!

To get many values of the estimate, we needed many random samples
Can’t go back and sample again from the population:

No time, no money

Stuck?

12 of 21

The Bootstrap

13 of 21

The Bootstrap

A technique for simulating repeated random sampling

All that we have is the original sample

… which is large and random
Therefore, it probably resembles the population

So we sample at random from the original sample!

14 of 21

The Problem

population

sample

What we wish we could see

What we get to see

All we have is the random sample
We know it could have come out differently
We need to know how different, to quantify the variability in estimates based on the sample
So we need to create another sample … or two … or more

15 of 21

Why the Bootstrap Works

population

sample

resamples

All of these look pretty similar, most likely.

16 of 21

Why We Need the Bootstrap

population

sample

resamples

We can’t see the parameter

But we can see the sample ...

and generate lots of resamples

17 of 21

The Bootstrap Principle

The bootstrap principle:

Re-sampling from the original random sample

≈ Sampling from the population

with high probability

Useful method for estimating many parameters if the original random sample is large enough

But doesn’t work well for estimating some parameters

18 of 21

Key to Resampling

From the original sample,

draw at random
with replacement
as many values as the original sample contained

The size of the new sample has to be the same as the original one, so that the two estimates are comparable

19 of 21

The Bootstrap Process

One Random Sample

True but unknown distribution

(population)

→ Random sample

(the original sample)

Bootstrap:

Empirical distribution of original sample (“population”)

→ Bootstrap sample 1

→ Estimate 1

→ Bootstrap sample 2

→ Estimate 2

...
→ Bootstrap sample 1000

→ Estimate 1000

(Demo)

20 of 21

Confidence Intervals

21 of 21

95% Confidence Interval

Interval of estimates of a parameter
Based on random sampling
95% is called the confidence level

Could be any percent between 0 and 100
Higher level means wider intervals

The confidence is in the process that creates the interval:

It generates a “good” interval about 95% of the time.

(Demo)