Econometrics: A Brief Overview
Agenda
Running Example
For today, we are all labor economists interested in education policy
Our goal is to estimate the causal relationship between class size and student achievement*
To illustrate different methods, we will play with some of the details of the hypothetical experiment, but the following will stay the same
* This is a common framing based on the Tennessee STAR experiment
Part 1: Background
Distributions
Data Generating Process
Populations and Samples
We would like to know the average test score in the population...
But… It is too expensive to track all students, so we take a sample
How should we estimate the population mean from the sample?
What makes one estimator better than another?
The Sample mean is...
Law of Large Numbers
As the sample size increases, the sample mean will be close to the population mean
In other words, the sample mean is a consistent estimator
NHST
Null Hypothesis -- The hypothesis to be tested, e.g. the average test score in the population is 75
Test Statistic -- A statistic (any function of the sample) used to assess a hypothesis.
p-value -- The probability of achieving at least as extreme a value as the computed test statistic assuming the null hypothesis is correct
Power -- The probability that the test correctly rejects the null hypothesis when the alternative is true
NHST (continued)
References
Causal Inference
Correlation does not equal causation… so how can we ever make a causal claim?
What if we could observe the same student in both a small class and a large class?
Potential Outcomes model of causality (Rubin causal model, 1974)
Obviously we cannot observe an entity in both conditions...
But what if the two groups are different?
Identification Strategy -- The combination of subject matter expertise, data generating process, and statistical methods used to justify a causal claim
What makes a study invalid?
Avoiding these pitfalls is key to having a valid identification strategy
Internal Validity -- Anything that can lead to biased estimates or invalid inferences is a violation
External Validity
What makes an experimental study invalid?
We said that randomization solves the selection problem, but....
References
Part 2: Core Methods
Regression
Assumptions
Given these strong assumptions why OLS?
OLS provides Consistent, Unbiased, Precise, and Efficient estimates relative to other estimators
T-test to OLS
What about parental support?
Consistency and Bias
Omitted Variable Bias
| support | income |
support | 1.0 | 0.8 |
income | 0.8 | 1.0 |
Correlation Matrix
Data Generating Process
Omitted Variable Bias (cont)
Part I Recap
More detail on our approach...
Problem/Question
What is the problem you are trying to solve? What is the causal relationship that you want to understand?
Data/Method
What data would allow you to answer the question? How was the data collected? Observational? Experimental? What methods can be used? What is our model for the process?
Mother Nature
What process is responsible for producing observations in the data set? Mother nature as a data factory stamping out observations? What levers exist?
Why simulation?
Causal inference requires the analyst to have an explicit model of how the world works
The assumed model can (and will) differ from the true model
By controlling the data generating process, simulation allows us to:
Instrumental Variables
IV is used when our assumed model is wrong in some systematic way:
We introduce an instrument into the model that satisfies two criteria:
The “Classic” Example
Fulton Fish Market
Can we make the case that stormy weather is a valid instrument?
Estimation: Two-Stage Least Squares
Why endogeneity matters
| smallClass | olo |
smallClass | 1.0 | 0.6 |
olo | 0.6 | 1.0 |
Correlation Matrix
Data Generating Process
Model
Exogenous, but Irrelevant
| smallClass | olo |
smallClass | 1.0 | 0.6 |
olo | 0.6 | 1.0 |
Correlation Matrix
Data Generating Process
First Stage
Second Stage
Instrument -- Parent drives a white car
Relevant, but Endogenous
| smallClass | olo | enroll |
smallClass | 1.0 | 0.6 | 0.4 |
olo | 0.6 | 1.0 | 0.6 |
enroll | 0.4 | 0.6 | 1.0 |
Correlation Matrix
Data Generating Process
First Stage
Second Stage
Valid, but Weak Instrument
| olo | treatment | instrument |
olo | 1.0 | 0.6 | 0.1 |
treatment | 0.6 | 1.0 | 0.0 |
instrument | 0.1 | 0.0 | 1.0 |
Correlation Matrix
Data Generating Process
First Stage
Second Stage
Valid Instrument
| olo | treatment | instrument |
olo | 1.0 | 0.6 | 0.6 |
treatment | 0.6 | 1.0 | 0.0 |
instrument | 0.6 | 0.0 | 1.0 |
Correlation Matrix
Data Generating Process
First Stage
Second Stage
Instrument -- Percent change in enrollment
Regression Discontinuity
Used when treatment depends on crossing some threshold
Often used with observational data
Two Types
Sharp RDD
Add an indicator for the threshold that determines treatment and regress
Data Generating Process
Model
Fuzzy RDD
Use the threshold as an instrument for treatment and estimate with two stage least squares
| big | split |
big | 1.0 | 0.8 |
split | 0.8 | 1.0 |
Correlation Matrix
Data Generating Process
First Stage
Second Stage
Difference in Difference
Used when there are differences between treatment groups unrelated to the treatment
Diff-in-Diff Estimation
Data Generating Process
Method 1
Method 2
References for Core Methods
Matching
Another approach to solving the selection bias problem
Definition
Any method that aims to balance the distribution of covariates between two (or more) groups
Objective -- Approximate a RCT with observational data
Brief History
Advantages
Assumptions
4 Key Steps
Example
Goal -- Use observational data to assess the impact of class sizes on test scores
References for Matching
Part 3: ML and Causal Inference
Late 00’s to present has seen a smattering of methods aimed at applying machine learning methods to causal inference
Key Developments*
* The following list comes from slides created by Skipper Seabold
ML and Matching
Traditionally, propensity scores have been estimated using a logit or probit model
Why not use some other SL/ML method that can output probabilities?
Double ML -- Fishing Bans and Coral Health
�Goal -- Estimate the effect of a fishing ban on coral reef health
Variables -- treatment, fish biomass , coral health variables (size, height, % sand, % hard coral)
Fishing ban non-randomly assigned
Intervention Objective -- Increase fish population in the short-run, improve coral health in the long run�
Procedure
References for ML and Causal Inference
Future Topics
Extra
Ics and Ings
Statistics, econometrics, statistical learning, and machine learning… what is the difference?
Models and Algorithms
A model is a statement about the data generating process, i.e. how the world works
An algorithm is a way to compute something
When doing econometrics, the objective is to study causal relationships, so we are in the land of models
Types of Data
We will be working with simulated data of each type over the next hour