1 of 54

Join at slido.com�#8422703

ⓘ

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

8422703

2 of 54

Estimators, Bias, and Variance

Exploring the different sources of error in the predictions that our models make.

Data 100/Data 200, Spring 2025 @ UC Berkeley

Narges Norouzi and Josh Grossman

Content credit: Acknowledgments

2

LECTURE 18

8422703

3 of 54

Announcements

You don't have to come to instructor office hours with an agenda. We're happy to talk about anything on your mind, and we are excited to talk to you!

Reminder about coffee chats with Josh. Slots are are beginning to fill up as we near the end of the term.

Folks joining from home: I'm switching from the laser pointer to the tablet pointer, and in the future we can separately post videos of physical demonstrations like OLS.

Probability is challenging to learn! Not a prereq of Data 100. If it feels tough, that's expected 🙂

3

8422703

4 of 54

Do not edit

How to change the design

Would you prefer that Josh's office hours:

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

8422703

5 of 54

Why Probability?

5

Model Selection Basics:�Cross Validation

Regularization

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

Reports, Decisions, and Solutions

?

Probability I:�Random Variables

Estimators

(today)

Probability II:�Bias and Variance

Inference/Multicollinearity

8422703

6 of 54

Today's lecture is conceptually challenging!

6

Be kind to yourself, especially after spring break! 💙

Notation of Parameter Estimation

8422703

7 of 54

Last time: Coins!

P(Heads) = 0.5�P(Tails) = 0.5

Let X_i be a random variable (r.v.) representing the i^th outcome of a series of coin flips.

If heads, X_i = 1. If tails, X_i = 0

P(X_i=1) = P(X_i=0) = 0.5

X_i ~ Bernoulli(p=0.5), where the X_i's are independent and identically distributed (i.i.d.)

If an r.v. X follows a Bernoulli distribution, P(X=1) = p and P(X=0) = 1-p

7

8422703

8 of 54

New terminology: Data-generating Process (DGP)

8

Data-generating process (DGP):

E(X_i) [Expected value]:

What is the long run average of the X_i's ?

E(X_i) = p = 0.5

Var(X_i) [Variance]:

How spread out are the X_i's around their average?

Var(X_i) = p(1-p) = 0.25

X₁ = 1

X_∞ = 1

. . .

X₃ = 0

X₂ = 0

8422703

9 of 54

An equivalent way to think about the coin flip DGP

9

Randomly sampling with replacement from a warehouse with an infinite number of random coin flip outcomes (i.e., a population of coin flips):

X₁ = 1

X₂ = 0

X₃ = 0

E(X_i) [Expected value]:

What is the average value of the X_i's ?

E(X_i) = p = 0.5

Var(X_i) [Variance]:

How spread out are the X_i's around their average?

Var(X_i) = p(1-p) = 0.25

🪙

X_∞ = 1

. . .

8422703

10 of 54

The structure of the population is usually unknown

10

Randomly sampling with replacement from a warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population of heights):

📜

X₁ = 69 in

X₂ = 71 in

X₃ = 64 in

E(X_i) [Expected value]:

What is the average value of the X_i's ?

We don't know!

Var(X_i) [Variance]:

How spread out are the X_i's around their average?

We don't know!

X_∞ = 60 in

. . .

8422703

11 of 54

Possible population distribution of heights

Some possible distributions of the 32,000 heights of Berkeley undergrads:

11

We do not know the true distribution of heights. But, we may want to estimate its properties.

For example, we might want to estimate the true average height of Berkeley undergrads. �[ Perhaps we are designing doors in a new building. ]

A method we know: Randomly sample 100 undergrads and calculate the sample mean.

8422703

12 of 54

A familiar approach: Estimate the true mean with a sample mean

12

Possible distributions of the raw data

i.i.d. random sample of 100 heights: �X₁, X₂ , . . . X₁₀₀

Our "best guess" for the population mean is 68.1 inches.

Harder Q: How do we know if 68.1 inches is a "good" estimate?

Today, we address this question!

📜

= 68.1 inches

Sample mean is 68.1 inches.

Warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population):

8422703

13 of 54

Thinking about a sample we could have observed

13

Warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population):

📜

Our universe (Observed sample):

i.i.d. random sample of 100 heights: �X₁, X₂ , . . . X₁₀₀

= 68.1 inches

Sample mean is 68.1 inches.

A parallel universe (An unobserved sample):

i.i.d. random sample of 100 heights: �X₁, X₂ , . . . X₁₀₀

= 69.2 inches

Sample mean is 69.2 inches.

Possible distributions of the raw data

8422703

14 of 54

There are many possible samples we could have observed!

14

Warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population):

There are (effectively) infinite possible samples of size 100 we could have drawn! But, we observe just one sample.

📜

= 68.1 inches

= 69.2 inches

= 67.9 inches

= 68.5 inches

This is also a distribution!

Possible distributions of the raw data

8422703

15 of 54

The CLT is a story of repeated sampling (i.e., parallel universes)

15

Warehouse with all 32,000 heights of Berkeley undergrads on slips of paper (a population):

📜

for multiple samples of size 100

Central Limit Theorem (CLT) (Data 8)

For i.i.d. samples of X_i's of size n (X₁, . . .,X_n),

Where n is "big enough",

the distribution of , the sample mean of X_i's ,

is roughly normal. .

Possible distributions of the raw data

8422703

16 of 54

The same CLT story, but represented as a DGP

16

For i.i.d. samples of X_i's of size n (X₁, . . .,X_n),

Where n is "big enough",

and X_i ~ Unknown, where E(X_i)=𝜇 and SD(X_i)=𝜎 ,

the distribution of , the sample mean of X_i's ,

is roughly normal.

Data-generating process (DGP):

for multiple samples of size 100

Possible distributions of the raw data

Central Limit Theorem (CLT) (Data 8)

8422703

17 of 54

The same CLT story, but represented as a DGP

17

Data-generating process (DGP):

𝜇

For i.i.d. samples of X_i's of size n (X₁, . . .,X_n),

Where n is "big enough",

and X_i ~ Unknown, where E(X_i)=𝜇 and SD(X_i)=𝜎 ,

the distribution of , the sample mean of X_i's ,

is roughly normal with mean 𝜇 and SD .

for multiple samples of size 100

Possible distributions of the raw data

Central Limit Theorem (CLT) (Data 8)

8422703

18 of 54

Central Limit Theorem (Data 8 + today's terminology)

For an i.i.d. sample of X_i's of size n,

Where n is "big enough",

and X_i ~ Unknown, where E(X_i)=𝜇 and SD(X_i)=𝜎 ,

the distribution of , the sample mean of X_i's ,

is roughly normal with mean 𝜇 and SD .

18

Sample mean of X

(Let's prove it!)

Proof out of scope

8422703

19 of 54

Central Limit Theorem (Data 8 + today's terminology)

For an i.i.d. sample of X_i's of size n,

Where n is "big enough",

and X_i ~ Unknown, where E(X_i)=𝜇 and SD(X_i)=𝜎 ,

the distribution of , the sample mean of X_i's ,

is roughly normal with mean 𝜇 and SD .

19

Expectation:

Variance/Standard Deviation:

IID → Cov(Xi, Xj) = 0

Sample mean of X

(Let's prove it!)

8422703

20 of 54

The Central Limit Theorem (CLT) in Data 100

20

for multiple samples of size 100

Data-generating process (DGP):

𝜇

Understanding the "parallel universe" setup of the CLT is critical to the rest of this lecture.

Next lecture, we'll learn how to construct parallel universes. Today, take them for granted 🙂.

Possible distributions of the raw data

Central Limit Theorem (CLT) (Data 8)

8422703

21 of 54

Do not edit

How to change the design

Which of the following is true about a data-generating process (DGP)? Select all that apply.

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

8422703

22 of 54

DGP Definitions

Which of the following is true about a data-generating process (DGP)? Select all that apply.

✅ A DGP is a model for how data are randomly drawn from a true distribution or population.

✅ We typically do not observe the true structure of a DGP.

✅ We typically use an observed sample of data to estimate properties of a DGP.

❌ After our analysis is complete, we often confirm whether estimated DGP properties are equal to the true DGP properties.

We rarely observe the DGP! Our analysis often assumes the data is generated with a certain structure, and we estimate components of that assumed structure.

Like before, "All models are wrong, but some are useful."

22

8422703

23 of 54

Properties of the estimator

23

There are infinite possible samples of size n we could have drawn! But, we observe just one sample.

Data-generating process (DGP):

Bias of : On average, how close are the 's to 𝜇?

Variance of : How spread out are the 's from each other?

MSE of : What's the expected squared difference between and 𝜇?

What is the behavior of across parallel sampling universes?

8422703

24 of 54

Generalizing our setup to 𝜽, an arbitrary property of the DGP

24

Data-generating process (DGP):

𝜽 is a property of the unknown distribution.

[ 𝜇, 𝞼², median are some example 𝜽's ]

Bias of : On average, how close are the 's to 𝜽?

Variance of : How spread out are the 's from each other?

MSE of : What's the expected squared difference between and 𝜽?

is an estimator of 𝜽 calculated with a sample of X_i's of size n. For example, is an estimator of 𝜇.

What is the behavior of across parallel sampling universes?

8422703

25 of 54

What is a good estimator?

25

Archery Analogy:

Center of the target is the true
Each arrow corresponds to a separate parameter estimate obtained from a different random sample.

For UC Berkeley heights:

Center of the target is 𝜇, the true average height of Berkeley undergrads
Each arrow corresponds to a sample mean computed from a different random sample.

Population parameter

True parameter�DGP property�Estimand

Sample statistic�Estimator

Estimate with data

8422703

26 of 54

What is a good estimator?

26

To evaluate the quality of an estimator , we can think about its behavior across parallel sampling universes:

On average, how close is the estimator to 𝜽?

How variable is the estimator across different random samples?

What's the average squared difference between the estimator and 𝜽?

Population parameter

True parameter�DGP property�Estimand

Sample statistic�Estimator

Estimate with data

If the bias of an estimator is zero, then it is said to be an unbiased estimator.

8422703

27 of 54

What is a good estimator?

27

Slido: Which target demonstrates �high variance and low bias?

A

B

C

D

Archery Analogy:

Center of the target is the true
Each arrow corresponds to a separate parameter estimate obtained from a different random sample.

8422703

28 of 54

Do not edit

How to change the design

Which target demonstrates high variance and low bias?

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

8422703

29 of 54

What is a good estimator?

29

Bias

Variance

Low Bias

Low Variance

High Bias

Low Variance

Low Bias

High Variance

High Bias

High Variance

Archery Analogy:

Center of the target is the true
Each arrow corresponds to a separate parameter estimate obtained from a different random sample.

On average, how close is the estimator to 𝜽?

How variable is the estimator across different random samples?

8422703

30 of 54

A new data-generating process

30

The image above is a GIF. Be sure to view in slideshow mode!

Data-generating process (DGP):

For a fixed set of features X_i ,

f(x)

Black points are the f(X_i)'s

Black lines are the random ϵ_i's

Blue points are what we observe.

f

Goal of modeling: How well can we reconstruct f with just the blue points?

8422703

31 of 54

A new data-generating process

31

The image above is a GIF. Be sure to view in slideshow mode!

X is 120 evenly spaced points from -1 to 5. �X is fixed/given/constant!

f(X) = 1 + X + X² Var(ϵ) = 9

We assume f is fixed but unknown to us.

f(x) + 𝝐

Data-generating process (DGP):

For a fixed set of features X_i ,

f(x)

8422703

32 of 54

Our model tries to reconstruct the underlying function, but it cannot address noise

32

Suppose we fit the model (X) = 𝜽₀ + 𝜽₁X + 𝜽₂X² .

On average, our model predicts the same as f.

Model is unbiased, but not perfect. Random noise!

The image above is a GIF. Be sure to view in slideshow mode!

f(x) + 𝝐

Data-generating process (DGP):

For a fixed set of features X_i ,

f(x)

8422703

33 of 54

If our model has low complexity, it will likely be biased and low variance

33

The image above is a GIF. Be sure to view in slideshow mode!

This time, we fit the model (X) = 𝜽₀ + 𝜽₁X .

Model is systematically incorrect, on average.

Model is biased! But, it looks similar across datasets. So, model has low variance.

f(x) + 𝝐

Data-generating process (DGP):

For a fixed set of features X_i ,

f(x)

8422703

34 of 54

If our model has high complexity, it will likely have low bias and high variance

34

The image above is a GIF. Be sure to view in slideshow mode!

We fit a 20th degree polynomial.

Model is correct on average. Unbiased!

But, big changes to across datasets. High variance!

f(x) + 𝝐

Data-generating process (DGP):

For a fixed set of features X_i ,

f(x)

8422703

35 of 54

Do not edit

How to change the design

Suppose we fit an OLS model to data randomly generated by the given DGP. Which of the given OLS specifications will have the lowest bias?

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

8422703

36 of 54

Bias and variance with polynomials

Suppose Y_i = f(X_i) + c_i , where c_i is an i.i.d. r.v. with E(c) = 0 and Var(c) = 1.

As it turns out, f(x) = 1 + x

Suppose we fit an OLS model to data randomly generated by the DGP above. Which of the following OLS specifications will have the lowest bias?

𝜽₀ → Biased! Insufficient complexity to model f(x) = 1 + x.
𝜽₀+𝜽₁X → Unbiased.
𝜽₀+𝜽₁X+𝜽₂X² → Also unbiased! On average, 𝜽₂ will be 0.

36

8422703

37 of 54

Do not edit

How to change the design

Suppose we fit an OLS model to data randomly generated by the given DGP. Which of the given OLS specifications will have the lowest variance?

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

8422703

38 of 54

Bias and variance with polynomials

Suppose Y_i = f(X_i) + c_i , where c_i is an i.i.d. r.v. with E(c) = 0 and Var(c) = 1.

As it turns out, f(x) = 1 + x

Suppose we fit an OLS model to data randomly generated by the DGP above. Which of the following OLS specifications will have the lowest bias?

𝜽₀ → Biased! Insufficient complexity to model f(x) = 1 + x.
𝜽₀+𝜽₁X → Unbiased.
𝜽₀+𝜽₁X+𝜽₂X² → Also unbiased! On average, 𝜽₂ will be 0.

38

Which of the following OLS specifications will have the lowest variance?

𝜽₀ → Lowest variance. Least model complexity to vary from sample to sample.
𝜽₀+𝜽₁X → Higher variance, but the ideal model since unbiased!
𝜽₀+𝜽₁X+𝜽₂X² → Even higher variance.

8422703

39 of 54

2-minute stretch break!

Lecture 18, Data 100 Spring 2025

39

8422703

40 of 54

Prediction model represented as a DGP

40

Y

X

Blue points: Random sample of (X_i,Y_i)'s

Prediction model: How well can we reconstruct f with a sample of (X_i,Y_i)'s?

Data-generating process (DGP):

Y

X

f

For a fixed set of features X_i ,

8422703

41 of 54

Estimated model parameters are random

41

The parameters of our fitted model depend on our training data. If the data are random, the fitted model is random, too!

Data-generating process (DGP):

Y

X

f

For a fixed set of features X_i ,

8422703

42 of 54

Evaluating the quality of a model across parallel universes

42

Model bias: How close is our fitted model to f, on average?

Model variance: How much does our fitted model vary across random samples?

Model risk (MSE): What's the typical squared error between our model's predictions and the actual outcomes?

Just like an estimator, we can evaluate a model's quality by considering its behavior across different training datasets (i.e., parallel sampling universes):

Data-generating process (DGP):

Y

X

f

For a fixed set of features X_i ,

8422703

43 of 54

Evaluating the quality of a model across parallel universes

43

Data-generating process (DGP):

Y

X

f

For a fixed set of features X_i ,

For a fixed/given/constant set of features X:

Model bias: How close is our fitted model to f, on average?

Model variance: How much does the fitted model's prediction vary across samples?

Model risk (MSE): What's the average squared error between our model's prediction and the actual outcome, across samples?

8422703

44 of 54

Prediction model setup and notation

44

Goal: What is the model risk for a single observation ? is given, so it is not random.

1a. The true DGP (i.e., population model) has the form

1b. We assume the function f is fixed but unknown. In other words, f is not random.

1c. ϵ is random noise generated i.i.d. from a distribution with mean 0 and variance 𝞼².

1d. Y is the observed outcome. Y depends on ϵ, so Y is random.

2a. We have a random sample of training data.

2b. We fit our own model to this random training data. So, is random, too.

2c. We get a prediction by plugging X into . In other words, . So, Ŷ is random.

3. To calculate model risk, we compute .

8422703

45 of 54

Decomposition of model risk for a single observation

45

Probability rules:

f and are fixed.

if X and Y are independent!

8422703

46 of 54

Decomposition of model risk for a single observation

46

Probability rules:

f and are fixed.

if X and Y are independent!

8422703

47 of 54

Decomposition of model risk for a single observation

47

Probability rules:

f and are fixed.

if X and Y are independent!

8422703

48 of 54

Decomposition of model risk for a single observation

48

Probability rules:

f and are fixed.

if X and Y are independent!

8422703

49 of 54

Decomposition of model risk for a single observation

49

Probability rules:

f and are fixed.

if X and Y are independent!

8422703

\text{Goal: Compute } \mathbb{E}[ (Y - \hat{Y})^2 ].

\text{Var}(Y - \hat{Y}) & = \mathbb{E}[ (Y - \hat{Y})^2 ] - \left( \mathbb{E}[ Y - \hat{Y} ] \right)^2 \\

\mathbb{E}[ (Y - \hat{Y})^2 ] & = \text{Var}(Y - \hat{Y}) + \left( \mathbb{E}[ Y - \hat{Y} ] \right) ^2

\text{Var}(Y - \hat{Y}) & = \text{Var} \left( f(\vec{X}) + \epsilon - \hat{f}(\vec{X}) \right) \\

& = \text{Var} \left(\epsilon - \hat{f}(\vec{X}) \right) \\

& = \text{Var} (\epsilon) + \text{Var} \left( \hat{f}(\vec{X}) \right) \\

& = \sigma^2 + \text{Var} \left( \hat{f}(\vec{X}) \right)

\left( \mathbb{E}[ Y - \hat{Y} ] \right) ^2 & = \left( \mathbb{E}[Y] - \mathbb{E}[\hat{Y}] \right)^2 \\

& = \left( \mathbb{E} \left[ f(\vec{X}) + \epsilon \right] - \mathbb{E} \left[ \hat{f}(\vec{X}) \right] \right)^2 \\

& = \left( \mathbb{E} \left[ f(\vec{X}) \right] + \mathbb{E}[\epsilon] - \mathbb{E} \left[ \hat{f}(\vec{X}) \right] \right)^2 \\

& = \left( f(\vec{X}) + 0 - \mathbb{E} \left[ \hat{f}(\vec{X}) \right] \right)^2 \\

& = \text{Bias} \left( \hat{f}(\vec{X}) \right)^2

50 of 54

Decomposition of model risk for a single observation

50

Probability rules:

f and are fixed.

if X and Y are independent!

8422703

\text{Goal: Compute } \mathbb{E}[ (Y - \hat{Y})^2 ].

\text{Var}(Y - \hat{Y}) & = \mathbb{E}[ (Y - \hat{Y})^2 ] - \left( \mathbb{E}[ Y - \hat{Y} ] \right)^2 \\

\mathbb{E}[ (Y - \hat{Y})^2 ] & = \text{Var}(Y - \hat{Y}) + \left( \mathbb{E}[ Y - \hat{Y} ] \right) ^2

\text{Var}(Y - \hat{Y}) & = \text{Var} \left( f(\vec{X}) + \epsilon - \hat{f}(\vec{X}) \right) \\

& = \text{Var} \left(\epsilon - \hat{f}(\vec{X}) \right) \\

& = \text{Var} (\epsilon) + \text{Var} \left( \hat{f}(\vec{X}) \right) \\

& = \sigma^2 + \text{Var} \left( \hat{f}(\vec{X}) \right)

\left( \mathbb{E}[ Y - \hat{Y} ] \right) ^2 & = \left( \mathbb{E}[Y] - \mathbb{E}[\hat{Y}] \right)^2 \\

& = \left( \mathbb{E} \left[ f(\vec{X}) + \epsilon \right] - \mathbb{E} \left[ \hat{f}(\vec{X}) \right] \right)^2 \\

& = \left( \mathbb{E} \left[ f(\vec{X}) \right] + \mathbb{E}[\epsilon] - \mathbb{E} \left[ \hat{f}(\vec{X}) \right] \right)^2 \\

& = \left( f(\vec{X}) + 0 - \mathbb{E} \left[ \hat{f}(\vec{X}) \right] \right)^2 \\

& = \text{Bias} \left( \hat{f}(\vec{X}) \right)^2

\mathbb{E}[ (Y - \hat{Y})^2 ] = \sigma^2 + \text{Var} \left( \hat{f}(\vec{X}) \right) + \text{Bias} \left( \hat{f}(\vec{X}) \right)^2

51 of 54

The Grand Finale: Bias-Variance Decomposition

Interpretation:

Irreducible error / observational variance / noise cannot be addressed by modeling.
Bias-Variance Tradeoff:

To decrease model bias, we increase model complexity. As a result, the model will have higher model variance.
To decrease model variance, we decrease model complexity. The model may underfit the sample data and may have higher model bias.

51

Model Risk = Irreducible error + Model Variance + (Model Bias)²

8422703

52 of 54

The Bias-Variance Tradeoff has been with us all along!

52

8422703

53 of 54

Practical tips for addressing bias, variance, and irreducible error

High variance corresponds to overfitting.

Your model may be too complex.
You can reduce the # of parameters, or regularize.

High bias corresponds to underfitting.

Your model may be too simple to capture complexities in the data.
You may have overregularized → Regularization biases us towards a constant model in exchange for reduced variance!�

Irreducible error

For a fixed dataset, nothing you can do. That's why it's irreducible.

53

8422703

54 of 54

Estimators, Bias, and Variance

Data 100/Data 200, Spring 2025 @ UC Berkeley

Narges Norouzi and Josh Grossman

Content credit: Acknowledgments

54

LECTURE 18

8422703