The Bootstrap
CSCI 104: Data Science and Computing for All
Williams College�Fall 2024
Announcements
Learning Objectives
Statistical Inference
Estimation
Association
Hypothesis Testing
A
B
Observed data happened due to chance?
Estimate unknown population parameter from sample?
Quantify confidence in estimates?
Statistical Inference
Unknown Parameter�
Average beak length for Fortis Finches
Parameter: A fixed number associated with a population
Statistic: A number computed from a (random) sample
Statistical Inference: Estimate the value of the parameter with statistics
Review: Random Sampling
Each member of population has same probability of being picked.
Statistical Inference
Statistic�Average beak length
One Random Sample of
Fortis Finches
Parameter: A fixed number associated with a population
Statistic: A number computed from a (random) sample
Statistical Inference: Estimate the value of the parameter with statistics
Statistical Inference
Statistic�Average beak length
One Random Sample of
Fortis Finches
Parameter: A fixed number associated with a population
Statistic: A number computed from a (random) sample
Statistical Inference: Estimate the value of the parameter with statistics
Varies with sample!
Empirical Distribution of a Statistic
Statistical Inference
Statistic�Average beak length
One Random Sample of
Fortis Finches
Parameter: A fixed number associated with a population
Statistic: A number computed from a (random) sample
Statistical Inference: Estimate the value of the parameter with statistics
How do we quantify our degree of confidence in our estimates and the sampling process?
Varies with sample!
Preview:
Confidence
intervals
Larger interval ➜ Less confidence
Smaller interval ➜ More confidence
What is the median salary for jobs involving data?
Do an online survey!
Real-world survey of thousands of people asking them to report their yearly salary
Mean vs. Median
Symmetric Distribution
Mean and median are the same
Skewed Distribution
Mean is pulled toward tail
More Skewed Distribution
Mean is pulled further toward tail
Mean: "Balancing point" of histogram
Median: "Halfway point" of data
1. Salary Data
$92,000
💡Think-Pair-Share
Interval of Estimates
Estimated population median
$92,000 ± $4,000
Hedge and give a range of estimates for the population parameter
Next few lectures: What do such intervals mean?�How do we make them?
Which estimate do you prefer?
Survey #1 Estimate�$92k ± 4k
Survey #2 Estimate�$100k ± 13k
Smaller interval ➜ more confidence
Quantifying estimation error
Error size
We expect less error:
Sample Estimate = Population Parameter + Error
Unknown parameter we're trying to estimate
Deviation from parameter due to sampling process
Quantifying estimation error
Sample Estimate = Population Parameter + Error
Unknown parameter we're trying to estimate
Deviation from parameter due to sampling process
The Big Question
How do we determine the distribution of errors we may see for estimates based on random samples?
Quantifying estimation error
Computers (simulation)
Rooted in algorithms
Math (analytical)
Rooted in rules (axioms)
😱
Game plan: Quantifying estimation error
Today
Bootstrapping: How do we do it? Why does it work?
Next Time
Create confidence interval after bootstrapping.
Estimating the Error of Sample A (First Attempt)
Population Median?
Sample A
Estimate:
�$92,000
506 random
survey respondents
???
Sample B
Estimate:
�$90,000
Sample C
Estimate:
�$91,500
Sample D
Estimate:
�$93,750
Other real-world samples
...
Distribution of Samples' Medians
Sample A Estimate
Key Insight
Empirical distribution of �samples' medians shows estimation error due to sampling
We can’t�actually�do this!
Why not?
Estimating the Error of Sample A (Second Attempt)
Population Median?
Sample A
Estimate:
�$92,000
506 random
survey respondents
???
...
The Bootstrap
“Lift ourselves up by our bootstraps”
Resample B
Estimate:
�$85,000
Resample C
Estimate:
�$89,000
Resample D
Estimate:
�$92,000
Distribution of Resamples' Medians
Sample A Estimate
Key Insight
Empirical distribution of resamples' medians also shows estimation error due to sampling
Bootstrapping algorithm
Repeat many times:
- Simulate one sample
- Record the sample Statistic
Analyze sample statistics for all trials
Example statistic:� np.median(simulated_resample)
(Re)sample from the original sample randomly with replacement. Use same size as original sample.
simulated_resample = � np.random.choice(sample,� len(sample))
2. Bootstrapping
Why the Bootstrap Works
Population
Many Resamples
What we wish to get
Sample
What we can get
Why the Bootstrap Works
Population
Many Resamples
Sample
Key Insight
Sampling from real world�≈
Bootstrap resampling from one real-world sample
(1) By Law of Averages,�sample distribution resembles population distribution
(2) By Law of Averages,�resample distributions resemble sample distribution and population distribution
Real-world and bootstrap distributions have same variability and distribution of errors
Many Samples from Population
�
Distribution of samples’ medians
Population
Distribution of resamples’ medians
Many Resamples from One Sample
Distribution of resamples' medians
Sample
Sample Estimate = �Population Parameter + Error
Bootstrapping is only �possible with computers!
"Bootstrap Methods: Another look at the Jackknife" �published in 1977.
~45 years is relatively recent in the history of math/statistics!
Game plan: Quantifying estimation error
Today
Bootstrapping: How do we do it? Why does it work?
Next Time
Create confidence interval after bootstrapping.
Learning Objectives