1 of 109

Lecture 10: Lost in the Clouds. Intro to Inference.

1

Andrea Massari

Q? pollev.com/jdgg

2 of 109

Functions of R.V.

Distributions of functions of R.V.
How to calculate expectations
Variance

2

Q? pollev.com/jdgg

3 of 109

Normal distribution

Functional form
CDF
Mean, median, min, max, variance

3

Q? pollev.com/jdgg

4 of 109

The clouds (Νεφέλαι)

4

Q? pollev.com/jdgg

5 of 109

Is this die fair?

Why care? Why do you ask?
What do we mean by it? How do we translate it to math or something concrete?

5

Q? pollev.com/jdgg

6 of 109

Future Data

6

If only…

Q? pollev.com/jdgg

7 of 109

Future Data

The great randomness gap

7

I.e. the future cannot be known deterministically.

Q? pollev.com/jdgg

8 of 109

Past Data

Future Data

The great randomness gap

8

I.e. the future cannot be known deterministically, even if I know the past

Q? pollev.com/jdgg

9 of 109

Past Data

Future Data

The great randomness gap

9

I.e. the future cannot be known deterministically, even if I know the past

Q? pollev.com/jdgg

10 of 109

Past Data

Future Data

The great randomness gap

Models

10

I.e. the future cannot be known deterministically, even if I know the past

Q? pollev.com/jdgg

11 of 109

Past Data

Future Data

The great randomness gap

Models

11

I.e. the future cannot be known deterministically, even if I know the past

Q? pollev.com/jdgg

12 of 109

Past Data

Future Data

The great randomness gap

Models

12

I.e. the future cannot be known deterministically, even if I know the past

Q? pollev.com/jdgg

13 of 109

Past Data

Future Data

Models

13

Q? pollev.com/jdgg

14 of 109

Is this die fair?

Why care? Why do you ask?
What do we mean by it? How do we translate it to math or something concrete?

�

14

Q? pollev.com/jdgg

15 of 109

Attendance check!

15

Andrea Massari

Word of the day: clouds

https://shorturl.at/0hQEy

Q? pollev.com/jdgg

16 of 109

Is this die fair?

Why care? Why do you ask?
What do we mean by it? How do we translate it to math or something concrete?

Is the die fair? = “Are p1=⅙, p2=⅙, p3=⅙, p4=⅙, p5=⅙, p6=⅙?”

�

16

Q? pollev.com/jdgg

17 of 109

Is this die fair?

Why care? Why do you ask?
What do we mean by it? How do we translate it to math or something concrete?

Is the die fair? = “Are p1=⅙, p2=⅙, p3=⅙, p4=⅙, p5=⅙, p6=⅙?”

= “is X=die roll ~ Categorical(⅙,⅙,⅙,⅙,⅙,⅙)?”

�

17

Q? pollev.com/jdgg

18 of 109

Is this die fair?

Why care? Why do you ask?
What do we mean by it? How do we translate it to math or something concrete?

Is the die fair? = “Are p1=⅙, p2=⅙, p3=⅙, p4=⅙, p5=⅙, p6=⅙?”

= “is X=die roll ~ Categorical(⅙,⅙,⅙,⅙,⅙,⅙)?”

IT’S A CLAIM ABOUT THE DISTRIBUTION, NOT THE DATA

18

Q? pollev.com/jdgg

19 of 109

Is this die fair?

Why care? Why do you ask?
What do we mean by it? How do we translate it to math or something concrete?

Is the die fair? = “Are p1=⅙, p2=⅙, p3=⅙, p4=⅙, p5=⅙, p6=⅙?”

= “is X=die roll ~ Categorical(⅙,⅙,⅙,⅙,⅙,⅙)?”

IT’S A CLAIM ABOUT THE DISTRIBUTION, NOT THE DATA

We’re lost in the clouds…�

19

Q? pollev.com/jdgg

20 of 109

Teaser?

Is the jury underrepresenting African Americans?
Are Latinos disproportionately impacted by police arrests?

20

Q? pollev.com/jdgg

21 of 109

Past Data

Models

21

The future is forgotten

Q? pollev.com/jdgg

22 of 109

Past Data

Models

Inference

22

Q? pollev.com/jdgg

23 of 109

Data (= Sample)

Models

23

Directly inaccessible through data (you can’t measure/observe this stuff)

Directly accessible/measurable

Random variables

Distributions of random variables

parameters

probabilities

distribution family

Q? pollev.com/jdgg

24 of 109

Data (= Sample)

Models

Assumed

Unknown

typically…

24

Directly accessible/measurable

Random variables

parameters

distribution family

Q? pollev.com/jdgg

25 of 109

E.g. …

25

Q? pollev.com/jdgg

26 of 109

E.g. …

26

Q? pollev.com/jdgg

27 of 109

E.g. …

Functional form = Normal Distribution(s)

Parameters = mu, sigma

27

Unknown, fixed, to be estimated …

Just assumed …

Q? pollev.com/jdgg

28 of 109

E.g. …

Functional form = Normal Distribution(s)

Parameters = mu, sigma

28

Unknown, fixed, to be estimated … How?

Just assumed … What if this is wrong???

Q? pollev.com/jdgg

29 of 109

E.g. …

Functional form = Normal Distribution(s)

Parameters = mu, sigma

29

Unknown, fixed, to be estimated … How?

Just assumed … What if this is wrong???

misspecification

estimation

Q? pollev.com/jdgg

30 of 109

🎰 Welcome to the STAT 131A Casino!

The casino has two slot machines.

I'm giving each of you 50 free plays.

30

Q? pollev.com/jdgg

31 of 109

🎰 Welcome to the STAT 131A Casino!

The first machine pays out with independent draws from a normal distribution with mean $10 and variance $5.�[ If the draw is negative, assume you get $0. ]

31

Q? pollev.com/jdgg

32 of 109

🎰 Welcome to the STAT 131A Casino!

The first machine pays out with independent draws from a normal distribution with mean $10 and variance $5.�[ If the draw is negative, assume you get $0. ]

The second machine pays out from a normal distribution with mean $11 and variance $5.

How do you make the most money?

32

Q? pollev.com/jdgg

33 of 109

🎰 Welcome to the STAT 131A Casino!

The first machine pays out with independent draws from a normal distribution with mean $10 and variance $5.�[ If the draw is negative, assume you get $0. ]

The second machine pays out from a normal distribution with mean $11 and variance $5.

How do you make the most money?�[ Exploit the $11 slot machine. ]

33

Q? pollev.com/jdgg

34 of 109

🎰 Welcome to the STAT 131A Casino!

After resetting the settings on each machine, I give each of you another 50 free plays.

34

Q? pollev.com/jdgg

35 of 109

🎰 Welcome to the STAT 131A Casino!

After resetting the settings on each machine, I give each of you another 50 free plays.

Now, each machine draws from a normal distribution with an unknown mean and variance $5. �[ I know the true means, but you do not. ]

35

Q? pollev.com/jdgg

36 of 109

🎰 Welcome to the STAT 131A Casino!

After resetting the settings on each machine, I give each of you another 50 free plays.

Now, each machine draws from a normal distribution with an unknown mean and variance $5. �[ I know the true means, but you do not. ]

How do you make the most money?�[ Discuss with neighbor ]

36

Q? pollev.com/jdgg

37 of 109

🗺️ Exploration vs. exploitation

To identify the higher paying machine, we allocate some pulls to each machine to explore their payoffs.

At some point, we exploit the machine we think pays better.

37

Q? pollev.com/jdgg

38 of 109

🗺️ Exploration vs. exploitation

To identify the higher paying machine, we allocate some pulls to each machine to explore their payoffs.

At some point, we exploit the machine we think pays better.

This is an instance of the multi-armed bandit problem, which shows up frequently in industry and academic research.�[ Efficient experimentation / AB testing ]�[ Ads, medicine, and even text message reminders for court! ]�[ See Data 102 ]

38

Q? pollev.com/jdgg

39 of 109

🔎 Statistical inference

Observed data (e.g., individual pulls from a slot machine)

↓

39

Q? pollev.com/jdgg

40 of 109

🔎 Statistical inference

Observed data (e.g., individual pulls from a slot machine)

↓

Properties of the data-generating distribution (e.g., true average payoff per pull)

40

Q? pollev.com/jdgg

41 of 109

🔎 Statistical inference

Observed data (e.g., individual pulls from a slot machine)�[ Sample ]

↓

Properties of the data-generating distribution (e.g., true average payoff per pull)

41

Q? pollev.com/jdgg

42 of 109

🔎 Statistical inference

Observed data (e.g., individual pulls from a slot machine)�[ Sample ]

↓

Properties of the data-generating distribution (e.g., true average payoff per pull)�[ Population ]�[ True data generating process ]

42

Q? pollev.com/jdgg

43 of 109

🔎 Statistical inference

Given a sample of data X₁, …, X_n from a distribution F, �what can you say about F?

43

Q? pollev.com/jdgg

44 of 109

🏢 Parametric inference��

Suppose ages at UC Berkeley are normally distributed with parameters μ and σ.

We observe X₁, …, X_n ~ N(μ, σ²)

How do you estimate μ and σ?

44

Q? pollev.com/jdgg

45 of 109

☮️ Non-parametric inference�

We do not assume that the distribution of ages F has a specific functional form.

We observe X₁, …, X_n

How do you estimate properties of F?�[ For example, its mean or median ]

45

Q? pollev.com/jdgg

46 of 109

👉 Point estimation

For some quantity of interest �find the single, best guess

46

Q? pollev.com/jdgg

47 of 109

👉 Point estimation

For some quantity of interest �find the single, best guess

47

Estimand

🎩 Estimator

Q? pollev.com/jdgg

48 of 109

👉 Point estimation

For some quantity of interest �find the single, best guess

Estimand: What is the slot machine's true average payoff?

48

Estimand

🎩 Estimator

Q? pollev.com/jdgg

49 of 109

👉 Point estimation

For some quantity of interest �find the single, best guess

Estimand: What is the slot machine's true average payoff?�Estimator: Find the average payoff of 50 pulls.

49

Estimand

🎩 Estimator

Q? pollev.com/jdgg

50 of 109

👉 Point estimation

For some quantity of interest �find the single, best guess

Estimand: What is the slot machine's true average payoff?�Estimator: Find the average payoff of 50 pulls.�(Point) Estimate: What was the observed average of 50 real pulls?

50

Estimand

🎩 Estimator

Q? pollev.com/jdgg

51 of 109

👉 Point estimation

The estimand is fixed, but unobserved.�[ It is not a random variable! ]

51

Q? pollev.com/jdgg

52 of 109

👉 Point estimation

The estimand is fixed, but unobserved.�[ It is not a random variable! ]

The estimator is a random variable.

depends on the [ randomly ] observed data.

52

Q? pollev.com/jdgg

53 of 109

53

Q? pollev.com/jdgg

54 of 109

🍬 Point estimation with M&Ms

54

Q? pollev.com/jdgg

55 of 109

🍬 Point estimation with M&Ms�[ 📝 Worksheet ]

M&Ms come in 3 primary colors (🔴 red, 🔵 blue, 🟡 yellow), and 3 non-primary colors (🟠 orange, 🟢 green, 🟤 brown).

55

Q? pollev.com/jdgg

56 of 109

🍬 Point estimation with M&Ms�[ 📝 Worksheet ]

M&Ms come in 3 primary colors (🔴 red, 🔵 blue, 🟡 yellow), and 3 non-primary colors (🟠 orange, 🟢 green, 🟤 brown).

At the M&Ms factory, suppose there is a setting for the proportion of M&Ms printed in primary colors.�[ As candy consumers, we do not know this setting. ]

56

Q? pollev.com/jdgg

57 of 109

🍬 Point estimation with M&Ms�[ 📝 Worksheet ]

M&Ms come in 3 primary colors (🔴 red, 🔵 blue, 🟡 yellow), and 3 non-primary colors (🟠 orange, 🟢 green, 🟤 brown).

At the M&Ms factory, suppose there is a setting for the proportion of M&Ms printed in primary colors.�[ As candy consumers, we do not know this setting. ]

In this scenario, what is the estimand, what could be the estimator, and what's a reasonable estimate?�[ Discuss with neighbor ]

57

Q? pollev.com/jdgg

58 of 109

🍬 Point estimation with M&Ms�[ 📝 Worksheet ]

Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]

58

Q? pollev.com/jdgg

59 of 109

🍬 Point estimation with M&Ms�[ 📝 Worksheet ]

Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]

Estimator: the proportion of primary-colored M&Ms as calculated from a random sample of M&Ms

59

Q? pollev.com/jdgg

60 of 109

🍬 Point estimation with M&Ms�[ 📝 Worksheet ]

Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]

Estimator: the proportion of primary-colored M&Ms as calculated from a random sample of M&Ms

Point estimate: Anything 0% to 100%

60

Q? pollev.com/jdgg

61 of 109

🍬 Point estimation with M&Ms�[ 📝 Worksheet ]

Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]

Estimator: the proportion of primary-colored M&Ms as calculated from a random sample of M&Ms

Point estimate: Anything 0% to 100%

Let's do some estimating! �[ Cue music ]

61

shorturl.at/qWq4j

Q? pollev.com/jdgg

62 of 109

🍬 Goal of M&Ms inference

Goal: Can we use a single bag of M&Ms to say anything meaningful about the estimand?

62

Q? pollev.com/jdgg

63 of 109

🍬 Goal of M&Ms inference

Goal: Can we use a single bag of M&Ms to say anything meaningful about the estimand?

Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]

63

Q? pollev.com/jdgg

64 of 109

🍬 Goal of M&Ms inference

Goal: Can we use a single bag of M&Ms to say anything meaningful about the estimand?

Estimand: the real proportion of primary-colored M&Ms�[ i.e., the fixed but unobserved setting on the machine ]

Crucially, we never observe the real value of the estimand.�[ Otherwise, we wouldn't need inference! ]

64

Q? pollev.com/jdgg

65 of 109

🍬 Classwide Pr(primary-colored)

For demonstration only, we will assume the proportion of primary colored M&Ms across all bags is the estimand.�[ But, remember that this is usually hidden to the analyst! ]

Classwide Pr(Primary-colored) = 547 / 1256

You will use the classwide statistic on HW2.

65

Q? pollev.com/jdgg

66 of 109

🥜 Inference in a nutshell

Under a set of assumptions,

66

Q? pollev.com/jdgg

67 of 109

🥜 Inference in a nutshell

Under a set of assumptions,

can we say anything about what could have happened,�[ i.e., in parallel universes ]

67

Q? pollev.com/jdgg

68 of 109

🥜 Inference in a nutshell

Under a set of assumptions,

can we say anything about what could have happened,�[ i.e., in parallel universes ]

using only our knowledge of what actually happened?�[ i.e., with just one sample? ]

68

Q? pollev.com/jdgg

69 of 109

😢 But, we only observe one universe!

69

Q? pollev.com/jdgg

70 of 109

😢 But, we only observe one universe!

🗳️ The proportion of Maricopa County voters who responded "Yes" when asked by phone whether they will vote for Harris.

70

Q? pollev.com/jdgg

71 of 109

😢 But, we only observe one universe!

🗳️ The proportion of Maricopa County voters who responded "Yes" when asked by phone whether they will vote for Harris.

📱 The average age of customers surveyed by a product manager at a startup looking to pivot into a new space.

71

Q? pollev.com/jdgg

72 of 109

😢 But, we only observe one universe!

🗳️ The proportion of Maricopa County voters who responded "Yes" when asked by phone whether they will vote for Harris.

📱 The average age of customers surveyed by a product manager at a startup looking to pivot into a new space.

🪳 The total number of cockroaches observed by inspectors in a randomly selected group of public housing units. �[ Unfortunately, a real example. ]

72

Q? pollev.com/jdgg

73 of 109

🐣 Step 0: Posit a data-generating model

X_i = 1 if primary colored

X_i = 0 if not primary colored

p : the desired proportion of primary colored M&Ms

73

Q? pollev.com/jdgg

74 of 109

🐣 Step 0: Posit a data-generating model

X_i = 1 if primary colored

X_i = 0 if not primary colored

p : the desired proportion of primary colored M&Ms

Is this model correct?�[ Discuss with neighbors ]

74

shorturl.at/Hycut

Q? pollev.com/jdgg

75 of 109

75

All models are wrong,

but some are useful.

George Box, 1976

Q? pollev.com/jdgg

76 of 109

👉 Point estimation�Primary coloring of M&Ms (i.e., coin flip outcomes)

��p : the desired proportion of primary colored M&Ms

Write a formula to estimate p from X₁ , X₂ , ... , X_n �[ 📝 Worksheet ]

76

Q? pollev.com/jdgg

77 of 109

👉 Point estimation�Primary coloring of M&Ms (i.e., coin flip outcomes)

��p : the desired proportion of primary colored M&Ms

Write a formula to estimate p from X₁ , X₂ , ... , X_n �[ 📝 Worksheet ]

77

Take the average!�[ But isn't this a proportion? ]

Q? pollev.com/jdgg

78 of 109

78

Average of 0s and 1s → Same as proportion that are 1s

1 if X_i = 1, otherwise 0

Q? pollev.com/jdgg

79 of 109

79

Average of 0s and 1s → Same as proportion that are 1s

x = c(TRUE, FALSE, FALSE, TRUE, TRUE)��sum(x==TRUE)/length(x) == sum(x)/length(x) == mean(x)

In general, you should never write "==TRUE" or "==FALSE" in your code!

1 if X_i = 1, otherwise 0

Q? pollev.com/jdgg

80 of 109

👉 Point estimation�Primary coloring of M&Ms (i.e., coin flip outcomes)

We have provided a best guess of the estimand.

How certain are you that this guess is "good"?�[ Next few lectures ]

80

Q? pollev.com/jdgg

81 of 109

81

Q? pollev.com/jdgg

82 of 109

🏦 Racial Disparities in Tax Audits�Elzayn et. al (2023)

82

Tax returns

Algorithm

Audited?

Q? pollev.com/jdgg

83 of 109

🏦 Racial Disparities in Tax Audits�Elzayn et. al (2023)

The IRS conducts a tax audit when it suspects that a taxpayer underreports their income.

Key question: Do auditing practices disparately impact Black taxpayers?

83

Q? pollev.com/jdgg

84 of 109

🏦 Racial Disparities in Tax Audits�Elzayn et. al (2023)

The IRS conducts a tax audit when it suspects that a taxpayer underreports their income.

Key question: Do auditing practices disparately impact Black taxpayers?

A first step: Are there any racial disparities in audit rates?

84

Q? pollev.com/jdgg

85 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

Problem: IRS does not observe race.

How can you estimate the probability that a taxpayer is Black based on their first+last name, and census block group?

85

Q? pollev.com/jdgg

86 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

As a first step, how would you express this statement with probability notation? [ Concept check ]

"The probability that a taxpayer named Felix Thompson who resides in block group ABC is Black"

Option 1: Pr(F = Felix, L = Thompson, G = ABC, Black = 1)�Option 2: Pr(Black = 1 | F = Felix, L = Thompson, G = ABC)�Option 3: Pr(F = Felix, L = Thompson, G = ABC | Black = 1)

86

shorturl.at/rt5m6

Q? pollev.com/jdgg

87 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

As a first step, how would you express this statement with probability notation?

"The probability that a taxpayer named Felix Thompson who resides in block group ABC is Black"

Option 1: Pr(F = Felix, L = Thompson, G = ABC, Black = 1)�Option 2: Pr(Black = 1 | F = Felix, L = Thompson, G = ABC)�Option 3: Pr(F = Felix, L = Thompson, G = ABC | Black = 1)

87

Q? pollev.com/jdgg

88 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

Next, recall Bayes' rule:

For Pr(Black=1 | F = Felix, L = Thompson, G = ABC), what is A and what is B? Write out the resulting equality.

88

Q? pollev.com/jdgg

89 of 109

89

Pr(Black=1) is the proportion of taxpayers who are Black.

Pr(F=Felix) is the proportion of taxpayers who are named Felix.

Pr(F=Felix|Black=1) is the proportion of Black taxpayers who are named Felix.

And so on.

Q? pollev.com/jdgg

90 of 109

90

Pr(Black=1) is the proportion of taxpayers who are Black.

Pr(F=Felix) is the proportion of taxpayers who are named Felix.

Pr(F=Felix|Black=1) is the proportion of Black taxpayers who are named Felix.

And so on.

Q? pollev.com/jdgg

91 of 109

91

Pr(Black=1) is the proportion of taxpayers who are Black.

Pr(F=Felix) is the proportion of taxpayers who are named Felix.

Pr(F=Felix|Black=1) is the proportion of Black taxpayers who are named Felix.

And so on.

Q? pollev.com/jdgg

92 of 109

92

Q? pollev.com/jdgg

93 of 109

93

Q? pollev.com/jdgg

94 of 109

94

Q? pollev.com/jdgg

95 of 109

95

Pr(Black=1) is the probability that a randomly selected taxpayers is Black. ✅�[ Estimate with U.S. Census data on race ]

Q? pollev.com/jdgg

96 of 109

96

Pr(Black=1) is the probability that a randomly selected taxpayers is Black. ✅

Pr(F=Felix, L=Thompson, G=ABC) is the probability that a randomly selected taxpayer is named "Felix Thompson" and lives in block group ABC. ✅�[ Estimate with taxpayer data containing names + locations ]�[ But, you will see in HW2 that we can ignore this term ]

Q? pollev.com/jdgg

97 of 109

97

Pr(Black=1) is the probability that a randomly selected taxpayers is Black. ✅

Pr(F=Felix, L=Thompson, G=ABC) is the probability that a randomly selected taxpayer is named "Felix Thompson" and lives in block group ABC. ✅

Pr(F=Felix, L=Thompson, G=ABC | Black = 1) is the probability that a randomly selected Black taxpayer is named "Felix Thompson" and lives in block group ABC. ❌�[ We don't know the race of individual taxpayers ]

Q? pollev.com/jdgg

98 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

If first name, last name, and location were independent:

Pr(F = Felix, L = Thompson, G = ABC) =

98

Q? pollev.com/jdgg

99 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

If first name, last name, and location were independent:

Pr(F = Felix, L = Thompson, G = ABC) = � Pr(F = Felix) Pr(L = Thompson) Pr(G = ABC)

99

Q? pollev.com/jdgg

100 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

If first name, last name, and location were independent:

Pr(F = Felix, L = Thompson, G = ABC) = � Pr(F = Felix) Pr(L = Thompson) Pr(G = ABC)

What would it mean for these three events to be independent? Is this a reasonable assumption?�[ Discuss with neighbor ]

100

Q? pollev.com/jdgg

101 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

If first name, last name, and location were independent:

Pr(F = Felix, L = Thompson, G = ABC) = � Pr(F = Felix) Pr(L = Thompson) Pr(G = ABC)

What would it mean for these three events to be independent? Is this a reasonable assumption?

If independent, knowing someone's last name would tell you nothing about where they live. [ Definitely not true! ]

101

Q? pollev.com/jdgg

102 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

If first name, last name, and location were independent across all first names, last names, and census block groups:

Pr(F, L, G) = � Pr(F) Pr(L) Pr(G)

102

Q? pollev.com/jdgg

103 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

If first name, last name, and location were independent, conditional on race:

Pr(F, L, G | R) = � Pr(F | R) Pr(L | R) Pr(G | R)

This is called conditional independence.

��* Independence does not imply conditional independence, and vice-versa. But, this is beyond the scope of 131a.

103

Q? pollev.com/jdgg

104 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

Suppose we assume conditional independence.�[ Strong assumption! ]

Pr(F, L, G | R) = Pr(F | R) Pr(L | R) Pr(G | R)

104

Q? pollev.com/jdgg

105 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

Suppose we assume conditional independence.�[ Strong assumption! ]

Pr(F, L, G | R) = Pr(F | R) Pr(L | R) Pr(G | R)

Pr(F | R) → Mortgage applications with self-reported race

105

Q? pollev.com/jdgg

106 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

Suppose we assume conditional independence.�[ Strong assumption! ]

Pr(F, L, G | R) = Pr(F | R) Pr(L | R) Pr(G | R)

Pr(F | R) → Mortgage applications with self-reported race

Pr(L | R) → Census data on last names + race

106

Q? pollev.com/jdgg

107 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

Suppose we assume conditional independence.�[ Strong assumption! ]

Pr(F, L, G | R) = Pr(F | R) Pr(L | R) Pr(G | R)

Pr(F | R) → Mortgage applications with self-reported race

Pr(L | R) → Census data on last names + race

Pr(G | R) → Census data on race by census block group

107

Q? pollev.com/jdgg

108 of 109

🧮 How do you impute race?�Elzayn et. al (2023)

Bayesian Improved First Name Surname Geocoding (BIFSG)�[ Impute race with first name + last name + location ]

In HW2, you will implement the Naive Bayes algorithm to classify emails as spam or not spam.

108

Q? pollev.com/jdgg

109 of 109

🏦 Disparities across all taxpayers�Elzayn et. al (2023)

🚨 Estimated 3–5x higher audit rate for Black taxpayers.

More on this very concerning finding later in the course.

109

Q? pollev.com/jdgg