Lecture 10: Lost in the Clouds. Intro to Inference.
1
Andrea Massari
Q? pollev.com/jdgg
Functions of R.V.
2
Q? pollev.com/jdgg
Normal distribution
3
Q? pollev.com/jdgg
The clouds (Νεφέλαι)
4
Q? pollev.com/jdgg
Is this die fair?
5
Q? pollev.com/jdgg
Future Data
6
If only…
Q? pollev.com/jdgg
Future Data
The great randomness gap
7
I.e. the future cannot be known deterministically.
Q? pollev.com/jdgg
Past Data
Future Data
The great randomness gap
8
I.e. the future cannot be known deterministically, even if I know the past
Q? pollev.com/jdgg
Past Data
Future Data
The great randomness gap
9
I.e. the future cannot be known deterministically, even if I know the past
Q? pollev.com/jdgg
Past Data
Future Data
The great randomness gap
Models
10
I.e. the future cannot be known deterministically, even if I know the past
Q? pollev.com/jdgg
Past Data
Future Data
The great randomness gap
Models
11
I.e. the future cannot be known deterministically, even if I know the past
Q? pollev.com/jdgg
Past Data
Future Data
The great randomness gap
Models
12
I.e. the future cannot be known deterministically, even if I know the past
Q? pollev.com/jdgg
Past Data
Future Data
Models
13
Q? pollev.com/jdgg
Is this die fair?
�
14
Q? pollev.com/jdgg
Attendance check!
15
Andrea Massari
Word of the day: clouds
https://shorturl.at/0hQEy
Q? pollev.com/jdgg
Is this die fair?
Is the die fair? = “Are p1=⅙, p2=⅙, p3=⅙, p4=⅙, p5=⅙, p6=⅙?”
�
16
Q? pollev.com/jdgg
Is this die fair?
Is the die fair? = “Are p1=⅙, p2=⅙, p3=⅙, p4=⅙, p5=⅙, p6=⅙?”
= “is X=die roll ~ Categorical(⅙,⅙,⅙,⅙,⅙,⅙)?”
�
17
Q? pollev.com/jdgg
Is this die fair?
Is the die fair? = “Are p1=⅙, p2=⅙, p3=⅙, p4=⅙, p5=⅙, p6=⅙?”
= “is X=die roll ~ Categorical(⅙,⅙,⅙,⅙,⅙,⅙)?”
IT’S A CLAIM ABOUT THE DISTRIBUTION, NOT THE DATA
18
Q? pollev.com/jdgg
Is this die fair?
Is the die fair? = “Are p1=⅙, p2=⅙, p3=⅙, p4=⅙, p5=⅙, p6=⅙?”
= “is X=die roll ~ Categorical(⅙,⅙,⅙,⅙,⅙,⅙)?”
IT’S A CLAIM ABOUT THE DISTRIBUTION, NOT THE DATA
We’re lost in the clouds…�
19
Q? pollev.com/jdgg
Teaser?
20
Q? pollev.com/jdgg
Past Data
Models
21
The future is forgotten
Q? pollev.com/jdgg
Past Data
Models
Inference
22
Q? pollev.com/jdgg
Data (= Sample)
Models
23
Directly inaccessible through data (you can’t measure/observe this stuff)
Directly accessible/measurable
Random variables
Distributions of random variables
parameters
probabilities
distribution family
Q? pollev.com/jdgg
Data (= Sample)
Models
Assumed
Unknown
typically…
24
Directly accessible/measurable
Random variables
parameters
distribution family
Q? pollev.com/jdgg
E.g. …
25
Q? pollev.com/jdgg
E.g. …
26
Q? pollev.com/jdgg
E.g. …
Functional form = Normal Distribution(s)
Parameters = mu, sigma
27
Unknown, fixed, to be estimated …
Just assumed …
Q? pollev.com/jdgg
E.g. …
Functional form = Normal Distribution(s)
Parameters = mu, sigma
28
Unknown, fixed, to be estimated … How?
Just assumed … What if this is wrong???
Q? pollev.com/jdgg
E.g. …
Functional form = Normal Distribution(s)
Parameters = mu, sigma
29
Unknown, fixed, to be estimated … How?
Just assumed … What if this is wrong???
misspecification
estimation
Q? pollev.com/jdgg
🎰 Welcome to the STAT 131A Casino!
The casino has two slot machines.
I'm giving each of you 50 free plays.
30
Q? pollev.com/jdgg
🎰 Welcome to the STAT 131A Casino!
The first machine pays out with independent draws from a normal distribution with mean $10 and variance $5.�[ If the draw is negative, assume you get $0. ]
31
Q? pollev.com/jdgg
🎰 Welcome to the STAT 131A Casino!
The first machine pays out with independent draws from a normal distribution with mean $10 and variance $5.�[ If the draw is negative, assume you get $0. ]
The second machine pays out from a normal distribution with mean $11 and variance $5.
How do you make the most money?
32
Q? pollev.com/jdgg
🎰 Welcome to the STAT 131A Casino!
The first machine pays out with independent draws from a normal distribution with mean $10 and variance $5.�[ If the draw is negative, assume you get $0. ]
The second machine pays out from a normal distribution with mean $11 and variance $5.
How do you make the most money?�[ Exploit the $11 slot machine. ]
33
Q? pollev.com/jdgg
🎰 Welcome to the STAT 131A Casino!
After resetting the settings on each machine, I give each of you another 50 free plays.
34
Q? pollev.com/jdgg
🎰 Welcome to the STAT 131A Casino!
After resetting the settings on each machine, I give each of you another 50 free plays.
Now, each machine draws from a normal distribution with an unknown mean and variance $5. �[ I know the true means, but you do not. ]
35
Q? pollev.com/jdgg
🎰 Welcome to the STAT 131A Casino!
After resetting the settings on each machine, I give each of you another 50 free plays.
Now, each machine draws from a normal distribution with an unknown mean and variance $5. �[ I know the true means, but you do not. ]
How do you make the most money?�[ Discuss with neighbor ]
36
Q? pollev.com/jdgg
🗺️ Exploration vs. exploitation
To identify the higher paying machine, we allocate some pulls to each machine to explore their payoffs.
At some point, we exploit the machine we think pays better.
37
Q? pollev.com/jdgg
🗺️ Exploration vs. exploitation
To identify the higher paying machine, we allocate some pulls to each machine to explore their payoffs.
At some point, we exploit the machine we think pays better.
This is an instance of the multi-armed bandit problem, which shows up frequently in industry and academic research.�[ Efficient experimentation / AB testing ]�[ Ads, medicine, and even text message reminders for court! ]�[ See Data 102 ]
38
Q? pollev.com/jdgg
🔎 Statistical inference
Observed data (e.g., individual pulls from a slot machine)
↓
39
Q? pollev.com/jdgg
🔎 Statistical inference
Observed data (e.g., individual pulls from a slot machine)
↓
Properties of the data-generating distribution (e.g., true average payoff per pull)
40
Q? pollev.com/jdgg
🔎 Statistical inference
Observed data (e.g., individual pulls from a slot machine)�[ Sample ]
↓
Properties of the data-generating distribution (e.g., true average payoff per pull)
41
Q? pollev.com/jdgg
🔎 Statistical inference
Observed data (e.g., individual pulls from a slot machine)�[ Sample ]
↓
Properties of the data-generating distribution (e.g., true average payoff per pull)�[ Population ]�[ True data generating process ]
42
Q? pollev.com/jdgg
🔎 Statistical inference
Given a sample of data X1, …, Xn from a distribution F, �what can you say about F?
43
Q? pollev.com/jdgg
🏢 Parametric inference��
Suppose ages at UC Berkeley are normally distributed with parameters μ and σ.
We observe X1, …, Xn ~ N(μ, σ2)
How do you estimate μ and σ?
44
Q? pollev.com/jdgg
☮️ Non-parametric inference�
We do not assume that the distribution of ages F has a specific functional form.
We observe X1, …, Xn
How do you estimate properties of F?�[ For example, its mean or median ]
45
Q? pollev.com/jdgg
👉 Point estimation
For some quantity of interest �find the single, best guess
46
Q? pollev.com/jdgg
👉 Point estimation
For some quantity of interest �find the single, best guess
47
Estimand
🎩 Estimator
Q? pollev.com/jdgg
👉 Point estimation
For some quantity of interest �find the single, best guess
Estimand: What is the slot machine's true average payoff?
48
Estimand
🎩 Estimator
Q? pollev.com/jdgg
👉 Point estimation
For some quantity of interest �find the single, best guess
Estimand: What is the slot machine's true average payoff?�Estimator: Find the average payoff of 50 pulls.
49
Estimand
🎩 Estimator
Q? pollev.com/jdgg
👉 Point estimation
For some quantity of interest �find the single, best guess
Estimand: What is the slot machine's true average payoff?�Estimator: Find the average payoff of 50 pulls.�(Point) Estimate: What was the observed average of 50 real pulls?
50
Estimand
🎩 Estimator
Q? pollev.com/jdgg
👉 Point estimation
The estimand is fixed, but unobserved.�[ It is not a random variable! ]
51
Q? pollev.com/jdgg
👉 Point estimation
The estimand is fixed, but unobserved.�[ It is not a random variable! ]
The estimator is a random variable.
depends on the [ randomly ] observed data.
52
Q? pollev.com/jdgg
53
Q? pollev.com/jdgg
🍬 Point estimation with M&Ms
54
Q? pollev.com/jdgg
🍬 Point estimation with M&Ms�[ 📝 Worksheet ]
M&Ms come in 3 primary colors (🔴 red, 🔵 blue, 🟡 yellow), and 3 non-primary colors (🟠 orange, 🟢 green, 🟤 brown).
55
Q? pollev.com/jdgg
🍬 Point estimation with M&Ms�[ 📝 Worksheet ]
M&Ms come in 3 primary colors (🔴 red, 🔵 blue, 🟡 yellow), and 3 non-primary colors (🟠 orange, 🟢 green, 🟤 brown).
At the M&Ms factory, suppose there is a setting for the proportion of M&Ms printed in primary colors.�[ As candy consumers, we do not know this setting. ]
56
Q? pollev.com/jdgg
🍬 Point estimation with M&Ms�[ 📝 Worksheet ]
M&Ms come in 3 primary colors (🔴 red, 🔵 blue, 🟡 yellow), and 3 non-primary colors (🟠 orange, 🟢 green, 🟤 brown).
At the M&Ms factory, suppose there is a setting for the proportion of M&Ms printed in primary colors.�[ As candy consumers, we do not know this setting. ]
In this scenario, what is the estimand, what could be the estimator, and what's a reasonable estimate?�[ Discuss with neighbor ]
57
Q? pollev.com/jdgg
🍬 Point estimation with M&Ms�[ 📝 Worksheet ]
Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]
58
Q? pollev.com/jdgg
🍬 Point estimation with M&Ms�[ 📝 Worksheet ]
Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]
Estimator: the proportion of primary-colored M&Ms as calculated from a random sample of M&Ms
59
Q? pollev.com/jdgg
🍬 Point estimation with M&Ms�[ 📝 Worksheet ]
Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]
Estimator: the proportion of primary-colored M&Ms as calculated from a random sample of M&Ms
Point estimate: Anything 0% to 100%
60
Q? pollev.com/jdgg
🍬 Point estimation with M&Ms�[ 📝 Worksheet ]
Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]
Estimator: the proportion of primary-colored M&Ms as calculated from a random sample of M&Ms
Point estimate: Anything 0% to 100%
Let's do some estimating! �[ Cue music ]
61
shorturl.at/qWq4j
Q? pollev.com/jdgg
🍬 Goal of M&Ms inference
Goal: Can we use a single bag of M&Ms to say anything meaningful about the estimand?
62
Q? pollev.com/jdgg
🍬 Goal of M&Ms inference
Goal: Can we use a single bag of M&Ms to say anything meaningful about the estimand?
Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]
63
Q? pollev.com/jdgg
🍬 Goal of M&Ms inference
Goal: Can we use a single bag of M&Ms to say anything meaningful about the estimand?
Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]
Crucially, we never observe the real value of the estimand.�[ Otherwise, we wouldn't need inference! ]
64
Q? pollev.com/jdgg
🍬 Classwide Pr(primary-colored)
For demonstration only, we will assume the proportion of primary colored M&Ms across all bags is the estimand.�[ But, remember that this is usually hidden to the analyst! ]
Classwide Pr(Primary-colored) = 547 / 1256
You will use the classwide statistic on HW2.
65
Q? pollev.com/jdgg
🥜 Inference in a nutshell
Under a set of assumptions,
66
Q? pollev.com/jdgg
🥜 Inference in a nutshell
Under a set of assumptions,
can we say anything about what could have happened,�[ i.e., in parallel universes ]
67
Q? pollev.com/jdgg
🥜 Inference in a nutshell
Under a set of assumptions,
can we say anything about what could have happened,�[ i.e., in parallel universes ]
using only our knowledge of what actually happened?�[ i.e., with just one sample? ]
68
Q? pollev.com/jdgg
😢 But, we only observe one universe!
69
Q? pollev.com/jdgg
😢 But, we only observe one universe!
🗳️ The proportion of Maricopa County voters who responded "Yes" when asked by phone whether they will vote for Harris.
70
Q? pollev.com/jdgg
😢 But, we only observe one universe!
🗳️ The proportion of Maricopa County voters who responded "Yes" when asked by phone whether they will vote for Harris.
📱 The average age of customers surveyed by a product manager at a startup looking to pivot into a new space.
71
Q? pollev.com/jdgg
😢 But, we only observe one universe!
🗳️ The proportion of Maricopa County voters who responded "Yes" when asked by phone whether they will vote for Harris.
📱 The average age of customers surveyed by a product manager at a startup looking to pivot into a new space.
🪳 The total number of cockroaches observed by inspectors in a randomly selected group of public housing units. �[ Unfortunately, a real example. ]
72
Q? pollev.com/jdgg
🐣 Step 0: Posit a data-generating model
Xi = 1 if primary colored
Xi = 0 if not primary colored
p : the desired proportion of primary colored M&Ms
73
Q? pollev.com/jdgg
🐣 Step 0: Posit a data-generating model
Xi = 1 if primary colored
Xi = 0 if not primary colored
p : the desired proportion of primary colored M&Ms
Is this model correct?�[ Discuss with neighbors ]
74
shorturl.at/Hycut
Q? pollev.com/jdgg
75
All models are wrong,
but some are useful.
George Box, 1976
Q? pollev.com/jdgg
👉 Point estimation�Primary coloring of M&Ms (i.e., coin flip outcomes)
��p : the desired proportion of primary colored M&Ms
Write a formula to estimate p from X1 , X2 , ... , Xn �[ 📝 Worksheet ]
76
Q? pollev.com/jdgg
👉 Point estimation�Primary coloring of M&Ms (i.e., coin flip outcomes)
��p : the desired proportion of primary colored M&Ms
Write a formula to estimate p from X1 , X2 , ... , Xn �[ 📝 Worksheet ]
77
Take the average!�[ But isn't this a proportion? ]
Q? pollev.com/jdgg
78
Average of 0s and 1s → Same as proportion that are 1s
1 if Xi = 1, otherwise 0
Q? pollev.com/jdgg
79
Average of 0s and 1s → Same as proportion that are 1s
x = c(TRUE, FALSE, FALSE, TRUE, TRUE)��sum(x==TRUE)/length(x) == sum(x)/length(x) == mean(x)
In general, you should never write "==TRUE" or "==FALSE" in your code!
1 if Xi = 1, otherwise 0
Q? pollev.com/jdgg
👉 Point estimation�Primary coloring of M&Ms (i.e., coin flip outcomes)
We have provided a best guess of the estimand.
How certain are you that this guess is "good"?�[ Next few lectures ]
80
Q? pollev.com/jdgg
81
Q? pollev.com/jdgg
🏦 Racial Disparities in Tax Audits�Elzayn et. al (2023)
82
Tax returns
Algorithm
Audited?
Q? pollev.com/jdgg
🏦 Racial Disparities in Tax Audits�Elzayn et. al (2023)
The IRS conducts a tax audit when it suspects that a taxpayer underreports their income.
Key question: Do auditing practices disparately impact Black taxpayers?
83
Q? pollev.com/jdgg
🏦 Racial Disparities in Tax Audits�Elzayn et. al (2023)
The IRS conducts a tax audit when it suspects that a taxpayer underreports their income.
Key question: Do auditing practices disparately impact Black taxpayers?
A first step: Are there any racial disparities in audit rates?
84
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
Problem: IRS does not observe race.
How can you estimate the probability that a taxpayer is Black based on their first+last name, and census block group?
85
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
As a first step, how would you express this statement with probability notation? [ Concept check ]
"The probability that a taxpayer named Felix Thompson who resides in block group ABC is Black"
Option 1: Pr(F = Felix, L = Thompson, G = ABC, Black = 1)�Option 2: Pr(Black = 1 | F = Felix, L = Thompson, G = ABC)�Option 3: Pr(F = Felix, L = Thompson, G = ABC | Black = 1)
86
shorturl.at/rt5m6
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
As a first step, how would you express this statement with probability notation?
"The probability that a taxpayer named Felix Thompson who resides in block group ABC is Black"
Option 1: Pr(F = Felix, L = Thompson, G = ABC, Black = 1)�Option 2: Pr(Black = 1 | F = Felix, L = Thompson, G = ABC)�Option 3: Pr(F = Felix, L = Thompson, G = ABC | Black = 1)
87
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
Next, recall Bayes' rule:
For Pr(Black=1 | F = Felix, L = Thompson, G = ABC), what is A and what is B? Write out the resulting equality.
88
Q? pollev.com/jdgg
89
Pr(Black=1) is the proportion of taxpayers who are Black.
Pr(F=Felix) is the proportion of taxpayers who are named Felix.
Pr(F=Felix|Black=1) is the proportion of Black taxpayers who are named Felix.
And so on.
Q? pollev.com/jdgg
90
Pr(Black=1) is the proportion of taxpayers who are Black.
Pr(F=Felix) is the proportion of taxpayers who are named Felix.
Pr(F=Felix|Black=1) is the proportion of Black taxpayers who are named Felix.
And so on.
Q? pollev.com/jdgg
91
Pr(Black=1) is the proportion of taxpayers who are Black.
Pr(F=Felix) is the proportion of taxpayers who are named Felix.
Pr(F=Felix|Black=1) is the proportion of Black taxpayers who are named Felix.
And so on.
Q? pollev.com/jdgg
92
Q? pollev.com/jdgg
93
Q? pollev.com/jdgg
94
Q? pollev.com/jdgg
95
Pr(Black=1) is the probability that a randomly selected taxpayers is Black. ✅�[ Estimate with U.S. Census data on race ]
Q? pollev.com/jdgg
96
Pr(Black=1) is the probability that a randomly selected taxpayers is Black. ✅
Pr(F=Felix, L=Thompson, G=ABC) is the probability that a randomly selected taxpayer is named "Felix Thompson" and lives in block group ABC. ✅�[ Estimate with taxpayer data containing names + locations ]�[ But, you will see in HW2 that we can ignore this term ]
Q? pollev.com/jdgg
97
Pr(Black=1) is the probability that a randomly selected taxpayers is Black. ✅
Pr(F=Felix, L=Thompson, G=ABC) is the probability that a randomly selected taxpayer is named "Felix Thompson" and lives in block group ABC. ✅
Pr(F=Felix, L=Thompson, G=ABC | Black = 1) is the probability that a randomly selected Black taxpayer is named "Felix Thompson" and lives in block group ABC. ❌�[ We don't know the race of individual taxpayers ]
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
If first name, last name, and location were independent:
Pr(F = Felix, L = Thompson, G = ABC) =
98
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
If first name, last name, and location were independent:
Pr(F = Felix, L = Thompson, G = ABC) = � Pr(F = Felix) Pr(L = Thompson) Pr(G = ABC)
99
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
If first name, last name, and location were independent:
Pr(F = Felix, L = Thompson, G = ABC) = � Pr(F = Felix) Pr(L = Thompson) Pr(G = ABC)
What would it mean for these three events to be independent? Is this a reasonable assumption?�[ Discuss with neighbor ]
100
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
If first name, last name, and location were independent:
Pr(F = Felix, L = Thompson, G = ABC) = � Pr(F = Felix) Pr(L = Thompson) Pr(G = ABC)
What would it mean for these three events to be independent? Is this a reasonable assumption?
If independent, knowing someone's last name would tell you nothing about where they live. [ Definitely not true! ]
101
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
If first name, last name, and location were independent across all first names, last names, and census block groups:
Pr(F, L, G) = � Pr(F) Pr(L) Pr(G)
102
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
If first name, last name, and location were independent, conditional on race:
Pr(F, L, G | R) = � Pr(F | R) Pr(L | R) Pr(G | R)
This is called conditional independence.
��* Independence does not imply conditional independence, and vice-versa. But, this is beyond the scope of 131a.
103
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
Suppose we assume conditional independence.�[ Strong assumption! ]
Pr(F, L, G | R) = Pr(F | R) Pr(L | R) Pr(G | R)
104
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
Suppose we assume conditional independence.�[ Strong assumption! ]
Pr(F, L, G | R) = Pr(F | R) Pr(L | R) Pr(G | R)
Pr(F | R) → Mortgage applications with self-reported race
105
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
Suppose we assume conditional independence.�[ Strong assumption! ]
Pr(F, L, G | R) = Pr(F | R) Pr(L | R) Pr(G | R)
Pr(F | R) → Mortgage applications with self-reported race
Pr(L | R) → Census data on last names + race
106
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
Suppose we assume conditional independence.�[ Strong assumption! ]
Pr(F, L, G | R) = Pr(F | R) Pr(L | R) Pr(G | R)
Pr(F | R) → Mortgage applications with self-reported race
Pr(L | R) → Census data on last names + race
Pr(G | R) → Census data on race by census block group
107
Q? pollev.com/jdgg
🧮 How do you impute race?�Elzayn et. al (2023)
Bayesian Improved First Name Surname Geocoding (BIFSG)�[ Impute race with first name + last name + location ]
In HW2, you will implement the Naive Bayes algorithm to classify emails as spam or not spam.
108
Q? pollev.com/jdgg
🏦 Disparities across all taxpayers�Elzayn et. al (2023)
🚨 Estimated 3–5x higher audit rate for Black taxpayers.
More on this very concerning finding later in the course.
109
Q? pollev.com/jdgg