Join at slido.com�#8422703
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
8422703
Estimators, Bias, and Variance
Exploring the different sources of error in the predictions that our models make.
Data 100/Data 200, Spring 2025 @ UC Berkeley
Narges Norouzi and Josh Grossman
Content credit: Acknowledgments
2
LECTURE 18
8422703
Announcements
You don't have to come to instructor office hours with an agenda. We're happy to talk about anything on your mind, and we are excited to talk to you!
Reminder about coffee chats with Josh. Slots are are beginning to fill up as we near the end of the term.
Folks joining from home: I'm switching from the laser pointer to the tablet pointer, and in the future we can separately post videos of physical demonstrations like OLS.
Probability is challenging to learn! Not a prereq of Data 100. If it feels tough, that's expected 🙂
3
8422703
Do not edit
Would you prefer that Josh's office hours:
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
8422703
Why Probability?
5
Model Selection Basics:�Cross Validation
Regularization
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
Reports, Decisions, and Solutions
?
Probability I:�Random Variables
Estimators
(today)
Probability II:�Bias and Variance
Inference/Multicollinearity
8422703
Today's lecture is conceptually challenging!
6
Be kind to yourself, especially after spring break! 💙
Notation of Parameter Estimation
8422703
Last time: Coins!
P(Heads) = 0.5�P(Tails) = 0.5
Let Xi be a random variable (r.v.) representing the ith outcome of a series of coin flips.
If heads, Xi = 1. If tails, Xi = 0
P(Xi=1) = P(Xi=0) = 0.5
Xi ~ Bernoulli(p=0.5), where the Xi's are independent and identically distributed (i.i.d.)
If an r.v. X follows a Bernoulli distribution, P(X=1) = p and P(X=0) = 1-p
7
8422703
New terminology: Data-generating Process (DGP)
8
Data-generating process (DGP):
E(Xi) [Expected value]:
What is the long run average of the Xi's ?
E(Xi) = p = 0.5
Var(Xi) [Variance]:
How spread out are the Xi's around their average?
Var(Xi) = p(1-p) = 0.25
X1 = 1
X∞ = 1
. . .
X3 = 0
X2 = 0
8422703
An equivalent way to think about the coin flip DGP
9
Randomly sampling with replacement from a warehouse with an infinite number of random coin flip outcomes (i.e., a population of coin flips):
X1 = 1
X2 = 0
X3 = 0
E(Xi) [Expected value]:
What is the average value of the Xi's ?
E(Xi) = p = 0.5
Var(Xi) [Variance]:
How spread out are the Xi's around their average?
Var(Xi) = p(1-p) = 0.25
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
🪙
X∞ = 1
. . .
8422703
The structure of the population is usually unknown
10
Randomly sampling with replacement from a warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population of heights):
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
X1 = 69 in
X2 = 71 in
X3 = 64 in
E(Xi) [Expected value]:
What is the average value of the Xi's ?
We don't know!
Var(Xi) [Variance]:
How spread out are the Xi's around their average?
We don't know!
X∞ = 60 in
. . .
8422703
Possible population distribution of heights
Some possible distributions of the 32,000 heights of Berkeley undergrads:
11
We do not know the true distribution of heights. But, we may want to estimate its properties.
For example, we might want to estimate the true average height of Berkeley undergrads. �[ Perhaps we are designing doors in a new building. ]
A method we know: Randomly sample 100 undergrads and calculate the sample mean.
8422703
A familiar approach: Estimate the true mean with a sample mean
12
Possible distributions of the raw data
i.i.d. random sample of 100 heights: �X1 , X2 , . . . X100
Our "best guess" for the population mean is 68.1 inches.
Harder Q: How do we know if 68.1 inches is a "good" estimate?
Today, we address this question!
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
= 68.1 inches
Sample mean is 68.1 inches.
Warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population):
8422703
Thinking about a sample we could have observed
13
Warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population):
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
Our universe (Observed sample):
i.i.d. random sample of 100 heights: �X1 , X2 , . . . X100
= 68.1 inches
Sample mean is 68.1 inches.
A parallel universe (An unobserved sample):
i.i.d. random sample of 100 heights: �X1 , X2 , . . . X100
= 69.2 inches
Sample mean is 69.2 inches.
Possible distributions of the raw data
8422703
There are many possible samples we could have observed!
14
Warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population):
There are (effectively) infinite possible samples of size 100 we could have drawn! But, we observe just one sample.
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
= 68.1 inches
= 69.2 inches
= 67.9 inches
= 68.5 inches
This is also a distribution!
Possible distributions of the raw data
8422703
The CLT is a story of repeated sampling (i.e., parallel universes)
15
Warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population):
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
📜
for multiple samples of size 100
Central Limit Theorem (CLT) (Data 8)
For i.i.d. samples of Xi's of size n (X1, . . .,Xn),
Where n is "big enough",
the distribution of , the sample mean of Xi's ,
is roughly normal. .
Possible distributions of the raw data
8422703
The same CLT story, but represented as a DGP
16
For i.i.d. samples of Xi's of size n (X1, . . .,Xn),
Where n is "big enough",
and Xi ~ Unknown, where E(Xi)=𝜇 and SD(Xi)=𝜎 ,
the distribution of , the sample mean of Xi's ,
is roughly normal.
Data-generating process (DGP):
for multiple samples of size 100
Possible distributions of the raw data
Central Limit Theorem (CLT) (Data 8)
8422703
The same CLT story, but represented as a DGP
17
Data-generating process (DGP):
𝜇
For i.i.d. samples of Xi's of size n (X1, . . .,Xn),
Where n is "big enough",
and Xi ~ Unknown, where E(Xi)=𝜇 and SD(Xi)=𝜎 ,
the distribution of , the sample mean of Xi's ,
is roughly normal with mean 𝜇 and SD .
for multiple samples of size 100
Possible distributions of the raw data
Central Limit Theorem (CLT) (Data 8)
8422703
Central Limit Theorem (Data 8 + today's terminology)
For an i.i.d. sample of Xi's of size n,
Where n is "big enough",
and Xi ~ Unknown, where E(Xi)=𝜇 and SD(Xi)=𝜎 ,
the distribution of , the sample mean of Xi's ,
is roughly normal with mean 𝜇 and SD .
18
Sample mean of X
(Let's prove it!)
Proof out of scope
8422703
Central Limit Theorem (Data 8 + today's terminology)
For an i.i.d. sample of Xi's of size n,
Where n is "big enough",
and Xi ~ Unknown, where E(Xi)=𝜇 and SD(Xi)=𝜎 ,
the distribution of , the sample mean of Xi's ,
is roughly normal with mean 𝜇 and SD .
19
Expectation:
Variance/Standard Deviation:
IID → Cov(Xi, Xj) = 0
Sample mean of X
(Let's prove it!)
(Let's prove it!)
8422703
The Central Limit Theorem (CLT) in Data 100
20
for multiple samples of size 100
Data-generating process (DGP):
𝜇
Understanding the "parallel universe" setup of the CLT is critical to the rest of this lecture.
Next lecture, we'll learn how to construct parallel universes. Today, take them for granted 🙂.
Possible distributions of the raw data
Central Limit Theorem (CLT) (Data 8)
8422703
Do not edit
Which of the following is true about a data-generating process (DGP)? Select all that apply.
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
8422703
DGP Definitions
Which of the following is true about a data-generating process (DGP)? Select all that apply.
✅ A DGP is a model for how data are randomly drawn from a true distribution or population.
✅ We typically do not observe the true structure of a DGP.
✅ We typically use an observed sample of data to estimate properties of a DGP.
❌ After our analysis is complete, we often confirm whether estimated DGP properties are equal to the true DGP properties.
We rarely observe the DGP! Our analysis often assumes the data is generated with a certain structure, and we estimate components of that assumed structure.
Like before, "All models are wrong, but some are useful."
22
8422703
Properties of the estimator
23
There are infinite possible samples of size n we could have drawn! But, we observe just one sample.
Data-generating process (DGP):
Bias of : On average, how close are the 's to 𝜇?
Variance of : How spread out are the 's from each other?
MSE of : What's the expected squared difference between and 𝜇?
What is the behavior of across parallel sampling universes?
8422703
Generalizing our setup to 𝜽, an arbitrary property of the DGP
24
Data-generating process (DGP):
𝜽 is a property of the unknown distribution.
[ 𝜇, 𝞼2, median are some example 𝜽's ]
Bias of : On average, how close are the 's to 𝜽?
Variance of : How spread out are the 's from each other?
MSE of : What's the expected squared difference between and 𝜽?
is an estimator of 𝜽 calculated with a sample of Xi's of size n. For example, is an estimator of 𝜇.
What is the behavior of across parallel sampling universes?
8422703
What is a good estimator?
25
Archery Analogy:
For UC Berkeley heights:
Population parameter
True parameter�DGP property�Estimand
Sample statistic�Estimator
Estimate with data
8422703
What is a good estimator?
26
To evaluate the quality of an estimator , we can think about its behavior across parallel sampling universes:
On average, how close is the estimator to 𝜽?
How variable is the estimator across different random samples?
What's the average squared difference between the estimator and 𝜽?
Population parameter
True parameter�DGP property�Estimand
Sample statistic�Estimator
Estimate with data
If the bias of an estimator is zero, then it is said to be an unbiased estimator.
8422703
What is a good estimator?
27
Slido: Which target demonstrates �high variance and low bias?
A
B
C
D
Archery Analogy:
8422703
Do not edit
Which target demonstrates high variance and low bias?
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
8422703
What is a good estimator?
29
Bias
Variance
Low Bias
Low Variance
High Bias
Low Variance
Low Bias
High Variance
High Bias
High Variance
Archery Analogy:
On average, how close is the estimator to 𝜽?
How variable is the estimator across different random samples?
8422703
A new data-generating process
30
The image above is a GIF. Be sure to view in slideshow mode!
Data-generating process (DGP):
For a fixed set of features Xi ,
f(x)
Black points are the f(Xi)'s
Black lines are the random ϵi's
Blue points are what we observe.
f
Goal of modeling: How well can we reconstruct f with just the blue points?
8422703
A new data-generating process
31
The image above is a GIF. Be sure to view in slideshow mode!
X is 120 evenly spaced points from -1 to 5. �X is fixed/given/constant!
f(X) = 1 + X + X2 Var(ϵ) = 9
We assume f is fixed but unknown to us.
f(x) + 𝝐
Data-generating process (DGP):
For a fixed set of features Xi ,
f(x)
8422703
Our model tries to reconstruct the underlying function, but it cannot address noise
32
Suppose we fit the model (X) = 𝜽0 + 𝜽1 X + 𝜽2 X2 .
On average, our model predicts the same as f.
Model is unbiased, but not perfect. Random noise!
The image above is a GIF. Be sure to view in slideshow mode!
f(x) + 𝝐
Data-generating process (DGP):
For a fixed set of features Xi ,
f(x)
8422703
If our model has low complexity, it will likely be biased and low variance
33
The image above is a GIF. Be sure to view in slideshow mode!
This time, we fit the model (X) = 𝜽0 + 𝜽1 X .
Model is systematically incorrect, on average.
Model is biased! But, it looks similar across datasets. So, model has low variance.
f(x) + 𝝐
Data-generating process (DGP):
For a fixed set of features Xi ,
f(x)
8422703
If our model has high complexity, it will likely have low bias and high variance
34
The image above is a GIF. Be sure to view in slideshow mode!
We fit a 20th degree polynomial.
Model is correct on average. Unbiased!
But, big changes to across datasets. High variance!
f(x) + 𝝐
Data-generating process (DGP):
For a fixed set of features Xi ,
f(x)
8422703
Do not edit
Suppose we fit an OLS model to data randomly generated by the given DGP. Which of the given OLS specifications will have the lowest bias?
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
8422703
Bias and variance with polynomials
Suppose Yi = f(Xi) + ci , where ci is an i.i.d. r.v. with E(c) = 0 and Var(c) = 1.
As it turns out, f(x) = 1 + x
Suppose we fit an OLS model to data randomly generated by the DGP above. Which of the following OLS specifications will have the lowest bias?
36
8422703
Do not edit
Suppose we fit an OLS model to data randomly generated by the given DGP. Which of the given OLS specifications will have the lowest variance?
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
8422703
Bias and variance with polynomials
Suppose Yi = f(Xi) + ci , where ci is an i.i.d. r.v. with E(c) = 0 and Var(c) = 1.
As it turns out, f(x) = 1 + x
Suppose we fit an OLS model to data randomly generated by the DGP above. Which of the following OLS specifications will have the lowest bias?
38
Which of the following OLS specifications will have the lowest variance?
8422703
2-minute stretch break!
Lecture 18, Data 100 Spring 2025
39
8422703
Prediction model represented as a DGP
40
Y
X
Blue points: Random sample of (Xi,Yi)'s
Prediction model: How well can we reconstruct f with a sample of (Xi,Yi)'s?
Data-generating process (DGP):
Y
X
f
For a fixed set of features Xi ,
8422703
Estimated model parameters are random
41
The parameters of our fitted model depend on our training data. If the data are random, the fitted model is random, too!
Data-generating process (DGP):
Y
X
f
For a fixed set of features Xi ,
8422703
Evaluating the quality of a model across parallel universes
42
Model bias: How close is our fitted model to f, on average?
Model variance: How much does our fitted model vary across random samples?
Model risk (MSE): What's the typical squared error between our model's predictions and the actual outcomes?
Just like an estimator, we can evaluate a model's quality by considering its behavior across different training datasets (i.e., parallel sampling universes):
Data-generating process (DGP):
Y
X
f
For a fixed set of features Xi ,
8422703
Evaluating the quality of a model across parallel universes
43
Data-generating process (DGP):
Y
X
f
For a fixed set of features Xi ,
For a fixed/given/constant set of features X:
Model bias: How close is our fitted model to f, on average?
Model variance: How much does the fitted model's prediction vary across samples?
Model risk (MSE): What's the average squared error between our model's prediction and the actual outcome, across samples?
8422703
Prediction model setup and notation
44
Goal: What is the model risk for a single observation ? is given, so it is not random.
1a. The true DGP (i.e., population model) has the form
1b. We assume the function f is fixed but unknown. In other words, f is not random.
1c. ϵ is random noise generated i.i.d. from a distribution with mean 0 and variance 𝞼2.
1d. Y is the observed outcome. Y depends on ϵ, so Y is random.
2a. We have a random sample of training data.
2b. We fit our own model to this random training data. So, is random, too.
2c. We get a prediction by plugging X into . In other words, . So, Ŷ is random.
3. To calculate model risk, we compute .
8422703
Decomposition of model risk for a single observation
45
Probability rules:
f and are fixed.
if X and Y are independent!
8422703
Decomposition of model risk for a single observation
46
Probability rules:
f and are fixed.
if X and Y are independent!
8422703
Decomposition of model risk for a single observation
47
Probability rules:
f and are fixed.
if X and Y are independent!
8422703
Decomposition of model risk for a single observation
48
Probability rules:
f and are fixed.
if X and Y are independent!
8422703
Decomposition of model risk for a single observation
49
Probability rules:
f and are fixed.
if X and Y are independent!
8422703
Decomposition of model risk for a single observation
50
Probability rules:
f and are fixed.
if X and Y are independent!
8422703
The Grand Finale: Bias-Variance Decomposition
Interpretation:
51
Model Risk = Irreducible error + Model Variance + (Model Bias)2
8422703
The Bias-Variance Tradeoff has been with us all along!
52
8422703
Practical tips for addressing bias, variance, and irreducible error
High variance corresponds to overfitting.
High bias corresponds to underfitting.
Irreducible error
53
8422703
Estimators, Bias, and Variance
Data 100/Data 200, Spring 2025 @ UC Berkeley
Narges Norouzi and Josh Grossman
Content credit: Acknowledgments
54
LECTURE 18
8422703