1 of 31

The Bootstrap

CSCI 104: Data Science and Computing for All

Williams College�Fall 2024

2 of 31

  • Lab 7
    • Hypothesis testing -- all good?
  • Prelab 8�
  • Last call for Project group preferences!
    • Fill it out today ASAP.

Announcements

3 of 31

  • Understand the challenges in estimating a population parameter.
  • Understand and implement bootstrapping.

Learning Objectives

4 of 31

Statistical Inference

Estimation

Association

Hypothesis Testing

A

B

Observed data happened due to chance?

Estimate unknown population parameter from sample?

Quantify confidence in estimates?

5 of 31

Statistical Inference

Unknown Parameter�

Average beak length for Fortis Finches

Parameter: A fixed number associated with a population

Statistic: A number computed from a (random) sample

Statistical Inference: Estimate the value of the parameter with statistics

6 of 31

Review: Random Sampling

Each member of population has same probability of being picked.

7 of 31

Statistical Inference

Statistic�Average beak length

One Random Sample of

Fortis Finches

Parameter: A fixed number associated with a population

Statistic: A number computed from a (random) sample

Statistical Inference: Estimate the value of the parameter with statistics

8 of 31

Statistical Inference

Statistic�Average beak length

One Random Sample of

Fortis Finches

Parameter: A fixed number associated with a population

Statistic: A number computed from a (random) sample

Statistical Inference: Estimate the value of the parameter with statistics

Varies with sample!

9 of 31

Empirical Distribution of a Statistic

  1. Create an empirical distribution of statistics (i.e. histogram) to show variability
  1. Observe the statistic from repetitions of a (sampling) experiment or simulation

10 of 31

Statistical Inference

Statistic�Average beak length

One Random Sample of

Fortis Finches

Parameter: A fixed number associated with a population

Statistic: A number computed from a (random) sample

Statistical Inference: Estimate the value of the parameter with statistics

How do we quantify our degree of confidence in our estimates and the sampling process?

Varies with sample!

11 of 31

Preview:

Confidence

intervals

Larger interval ➜ Less confidence

Smaller interval ➜ More confidence

12 of 31

What is the median salary for jobs involving data?

Do an online survey!

Real-world survey of thousands of people asking them to report their yearly salary

13 of 31

Mean vs. Median

Symmetric Distribution

Mean and median are the same

Skewed Distribution

Mean is pulled toward tail

More Skewed Distribution

Mean is pulled further toward tail

Mean: "Balancing point" of histogram

Median: "Halfway point" of data

14 of 31

1. Salary Data

15 of 31

  1. Population vs. sample. Will the median salary in the population be the same as the median salary in the sample?
  2. Distribution of sample. Given this distribution of survey responses, do we expect to have more $90k salaries or $145k salaries in our next survey?
  3. Estimate for population. Given #1 and #2, brainstorm how should we report an estimate the median salary in the population.

$92,000

💡Think-Pair-Share

16 of 31

Interval of Estimates

Estimated population median

$92,000 ± $4,000

Hedge and give a range of estimates for the population parameter

Next few lectures: What do such intervals mean?�How do we make them?

17 of 31

Which estimate do you prefer?

Survey #1 Estimate�$92k ± 4k

Survey #2 Estimate�$100k ± 13k

Smaller interval ➜ more confidence

18 of 31

Quantifying estimation error

Error size

We expect less error:

  • as the sample size approaches population size.
  • when every value in our sample is similar (less variability).

Sample Estimate = Population Parameter + Error

Unknown parameter we're trying to estimate

Deviation from parameter due to sampling process

19 of 31

Quantifying estimation error

Sample Estimate = Population Parameter + Error

Unknown parameter we're trying to estimate

Deviation from parameter due to sampling process

The Big Question

How do we determine the distribution of errors we may see for estimates based on random samples?

20 of 31

Quantifying estimation error

Computers (simulation)

Rooted in algorithms

  • Approximate solutions
  • Often convincing
  • Non-trivial problems potentially captured clearly with code

Math (analytical)

Rooted in rules (axioms)

  • Exact solutions
  • Straightforward for simple problems
  • Very few analytical properties for statistics like the median (unlike the mean, which has theorems based on bell-curves)

😱

21 of 31

Game plan: Quantifying estimation error

Today

Bootstrapping: How do we do it? Why does it work?

Next Time

Create confidence interval after bootstrapping.

22 of 31

Estimating the Error of Sample A (First Attempt)

Population Median?

Sample A

Estimate:

$92,000

506 random

survey respondents

???

Sample B

Estimate:

$90,000

Sample C

Estimate:

$91,500

Sample D

Estimate:

$93,750

Other real-world samples

...

Distribution of Samples' Medians

Sample A Estimate

Key Insight

Empirical distribution of �samples' medians shows estimation error due to sampling

We can’t�actually�do this!

Why not?

23 of 31

Estimating the Error of Sample A (Second Attempt)

Population Median?

Sample A

Estimate:

$92,000

506 random

survey respondents

???

...

The Bootstrap

“Lift ourselves up by our bootstraps”

Resample B

Estimate:

$85,000

Resample C

Estimate:

$89,000

Resample D

Estimate:

$92,000

Distribution of Resamples' Medians

Sample A Estimate

Key Insight

Empirical distribution of resamples' medians also shows estimation error due to sampling

24 of 31

Bootstrapping algorithm

Repeat many times:

- Simulate one sample

- Record the sample Statistic

Analyze sample statistics for all trials

Example statistic:np.median(simulated_resample)

(Re)sample from the original sample randomly with replacement. Use same size as original sample.

simulated_resample = � np.random.choice(sample,� len(sample))

25 of 31

2. Bootstrapping

26 of 31

Why the Bootstrap Works

Population

Many Resamples

What we wish to get

Sample

What we can get

27 of 31

Why the Bootstrap Works

Population

Many Resamples

Sample

Key Insight

Sampling from real world�

Bootstrap resampling from one real-world sample

(1) By Law of Averages,�sample distribution resembles population distribution

(2) By Law of Averages,�resample distributions resemble sample distribution and population distribution

28 of 31

Real-world and bootstrap distributions have same variability and distribution of errors

Many Samples from Population

Distribution of samples’ medians

Population

Distribution of resamples’ medians

Many Resamples from One Sample

Distribution of resamples' medians

Sample

Sample Estimate = �Population Parameter + Error

29 of 31

Bootstrapping is only �possible with computers!

"Bootstrap Methods: Another look at the Jackknife" �published in 1977.

~45 years is relatively recent in the history of math/statistics!

30 of 31

Game plan: Quantifying estimation error

Today

Bootstrapping: How do we do it? Why does it work?

Next Time

Create confidence interval after bootstrapping.

31 of 31

  • Understand the challenges in estimate a population parameter.
  • Understand and implement bootstrapping.

Learning Objectives