1 of 52

Econometrics: A Brief Overview

2 of 52

Agenda

Background
Core Methods
Machine Learning and Causal Inference

3 of 52

Running Example

For today, we are all labor economists interested in education policy

Our goal is to estimate the causal relationship between class size and student achievement*

To illustrate different methods, we will play with some of the details of the hypothetical experiment, but the following will stay the same

We will be working with simulated data
Students are the main unit of observation
Test scores are the primary outcome measure

* This is a common framing based on the Tennessee STAR experiment

4 of 52

Part 1: Background

5 of 52

Distributions

Data Generating Process

6 of 52

Populations and Samples

We would like to know the average test score in the population...

But… It is too expensive to track all students, so we take a sample

How should we estimate the population mean from the sample?

The the ith value from the sample
Mode
Mean

What makes one estimator better than another?

Consistent
Unbiased
Efficient

The Sample mean is...

A random variable
The Best Linear Unbiased Estimator of the population mean

7 of 52

Law of Large Numbers

As the sample size increases, the sample mean will be close to the population mean

In other words, the sample mean is a consistent estimator

8 of 52

NHST

Null Hypothesis -- The hypothesis to be tested, e.g. the average test score in the population is 75

Test Statistic -- A statistic (any function of the sample) used to assess a hypothesis.

p-value -- The probability of achieving at least as extreme a value as the computed test statistic assuming the null hypothesis is correct

Power -- The probability that the test correctly rejects the null hypothesis when the alternative is true

9 of 52

NHST (continued)

Left Plot

The sample mean is itself a random variable, and therefore, it also has a distribution.
The empirical distribution of the sample mean is shown.
The true distribution of an estimator is the sampling distribution
The sample mean has a student’s t sampling distribution, which is closely approximated by a normal distribution for n > 30 or so

Middle Plot

The t-statistic combines the sample mean and standard error to create a test statistic for a hypothesis
Since it is a function of a random variable (the sample mean), it is also a random variable
Typically, we do not know the population variance, so we use the standard deviation of the sampling distribution as an estimator for it.

Right plot

In large samples, the distribution of the t-statistic is approximately normal by the central limit theorem
In practice, people choose a threshold for the p-value, below which, they will choose to reject the null hypothesis.
There is nothing special about these thresholds, i.e. no theoretical justification for them.

10 of 52

References

Introduction to Econometrics: Chapters 2 and 3
Introduction to Mathematical Statistics: Chapters 1 and 4

11 of 52

Causal Inference

Correlation does not equal causation… so how can we ever make a causal claim?

What if we could observe the same student in both a small class and a large class?

Potential Outcomes model of causality (Rubin causal model, 1974)

The treatment effect is the difference in test scores between the two scenarios

Obviously we cannot observe an entity in both conditions...

Causal inference in fundamentally a missing data problem
Instead, we consider the average treatment effect across treated and untreated groups

But what if the two groups are different?

This is selection bias and it is the fundamental goal of econometrics to overcome it
Randomization solves the selection problem
Other methods can be used to make causal claims with quasi-experimental and observational data

Identification Strategy -- The combination of subject matter expertise, data generating process, and statistical methods used to justify a causal claim

12 of 52

What makes a study invalid?

Avoiding these pitfalls is key to having a valid identification strategy

Internal Validity -- Anything that can lead to biased estimates or invalid inferences is a violation

Omitted variables
Misspecified functional form
Measurement Error
Selection bias
Simultaneous causality
Incorrect standard errors

External Validity

Non-representative sample
Non-representative program or policy

Internal Validity

Omitted Variables

A variable that is part of the data generating process, but is not included in the model

Misspecified Functional Form

Example: Chosen regression is linear, but the population relationship is non-linear
Example: There are interactions that are part of the DGP that are not included

Measurement Error (errors in variables bias)

Typographical errors
systematic misreporting (lying or otherwise)

Citizenship question on the US census

Causes the estimated coefficient to be biased towards zero if the direction of the error is random

Sample bias

Non-random missing data
Non-random sampling (treatment is assigned non-randomly)

Simultaneous Causality

A change in the dependent variable (endogenous) also influences on or more independent (exogenous) variables

Another concern if working within the NHST framework are invalid standard errors, which can arise from a lot of things:

Multiple comparisons
Assuming homoskedasticity when there is heteroskedasticity
Correlated error terms

Panel/time series data
Hierarchical data

All of the methods that we will discuss are an attempt to avoid these pitfalls that would invalidate the analysis. Avoiding these pitfalls is equivalent to having a valid identification strategy

13 of 52

What makes an experimental study invalid?

We said that randomization solves the selection problem, but....

Failure to randomize
Failure to follow treatment protocol
Attrition
Experimental Effects
Small sample sizes

14 of 52

References

Introduction to Econometrics: Chapter 13
Mostly Harmless Econometrics: Chapter 1
Statistical Learning from a Regression Perspective: Chapter 1
Neyman, 1923
Rubin, 1974

15 of 52

Part 2: Core Methods

16 of 52

Regression

Assumptions

Predictions are good, on average
Treatment and outcome variables are independent and identically distributed
Large outliers are unlikely

Given these strong assumptions why OLS?

BLUE
MVUE

OLS provides Consistent, Unbiased, Precise, and Efficient estimates relative to other estimators

Predictions are good, on average

Alternative formulations

Any unobserved factors not captured by X do not influence the error term
The error term is a catch-all for anything that causes the value of particular outcome to differ from its prediction
The population regression line is the conditional expectation function
Mother Nature

The predictors used by Mother Nature are known
The transformations used by Mother Nature are known
The predictors are known to be combined in a linear fashion
The predictors are available in the data

Violated when

There is correlation between the error term and the treatment, then this assumption is violated

The combined set of treatment and outcome variables are independent and identically distributed

Each perturbation is realized independently of all other perturbations and all come from the same distribution with expectation 0
Violated when

Non-random treatment assignment (selection bias)
There is auto-correlation (panel data and time series data)
Hierarchical data (students -> classrooms -> districts -> states)

Larger outliers are unlikely

X and Y have nonzero finite fourth moments (kurtosis)

The first equation at the bottom shows a regression model with an intercept and a single parameter
The second equation shows the minimization problem that is solved by the OLS estimator. The name should be a little clearer now

Final Note: Regression for causal inference is typically applied to experimental data. However, regression is a more broadly-applicable method for studying associations between variables.

17 of 52

T-test to OLS

18 of 52

What about parental support?

Set Up

Obviously class size is not the only determinant of test scores
Students with parental support are likely to do better, on average, so let’s simulate
We have changed the data generating process, i.e. mother nature has changed how she determines test scores
There is no dependence (correlation) between small classes and parental support

Notes

Regression output

The coefficient on the “treated” variable does not change noticeably when adding the parental support variable
The standard error on the coefficient does decrease (more precise estimate), so the t-stat and p value decrease
Even though there is no omitted variable bias, we do get a more precise estimate of the effect because we are explaining more of the variance
The second model is correctly specified and the OLS assumptions are met, so the estimated coefficients are unbiased and consistent

Boxplot

X axis is treatment
Y axis is score
Blue is the distribution of scores for students who do not have parental support
White is the distribution of scores for students who have parental support

19 of 52

Consistency and Bias

We are considering the same data generating process as in the previous slide
Left Plot

Scatter plot of the estimated coefficient on smallClass as you increase the sample size
100 draws at each sample size
Although all are centered around 10 (unbiased), the estimates are more tightly clustered as you increase the sample size. This is just another way to show consistency of the estimates

Right Plot

Distribution plots for the coefficient on smallClass in the model that includes parental support and the one that does not
For each model, 100 repeated draws at a sample size of 1,000
Distributions are clearly similar

Note: There is no omitted variable bias even in the model that excludes parental support. How? OVB requires two conditions:

Excluded variable is correlated with the endogenous/dependent variable (Check)
Excluded variable is correlated with one or more included exogenous/independent variables

20 of 52

Omitted Variable Bias

	support	income
support	1.0	0.8
income	0.8	1.0

Correlation Matrix

Data Generating Process

Previous example showed how excluding a predictor that is important in the data generating process does not necessarily lead to omitted variable bias. Let’s simulate what happens when both conditions required for OMB are met.

DGP

Adding income to our data generating process with the thought that students of wealthier parents do better on tests. This could be for any number of reasons. Let’s speculate on a few.
Income and family support are constructed to be highly correlated (higher income -> higher probability of family support)
Compare the coefficients on “no_support” in both models
Because income and support are correlated, a model that excludes income will have omitted variable bias, which is clear from the difference in the coefficients for the two models
Notice how the effect of income gets attributed to no family support since they are highly correlated
Consider what would happen if the correlation was with the treatment instead of another independent variable? Could we trust that estimate?

Table that thought. We will come back to it in a few slides

21 of 52

Omitted Variable Bias (cont)

22 of 52

Part I Recap

Sprinted through STAT-100
Defined causal inference through the Potential Outcomes framework

Goal of any method is to solve the selection bias problem
We desire unbiased, consistent, and low variance estimates for the causal effect under study

Talked about what makes a data analysis invalid
Explored single and multiple Regression

Omitting important variables can lead to biased estimates

23 of 52

More detail on our approach...

Problem/Question

What is the problem you are trying to solve? What is the causal relationship that you want to understand?

Data/Method

What data would allow you to answer the question? How was the data collected? Observational? Experimental? What methods can be used? What is our model for the process?

Mother Nature

What process is responsible for producing observations in the data set? Mother nature as a data factory stamping out observations? What levers exist?

24 of 52

Why simulation?

Causal inference requires the analyst to have an explicit model of how the world works

The assumed model can (and will) differ from the true model

By controlling the data generating process, simulation allows us to:

Explore the effects of divergence between the assumed model and the true model
Better understand our methods
Develop an intuition for what could be driving odd results in real data

25 of 52

Instrumental Variables

IV is used when our assumed model is wrong in some systematic way:

Omitted variable
Simultaneous causality
Measurement Error
Selection bias (non-random treatment assignment)

We introduce an instrument into the model that satisfies two criteria:

Exogeneity -- The instrument affects the outcome only through the endogenous variable
Relevance -- The instrument is highly correlated with the endogenous variable

26 of 52

The “Classic” Example

Fulton Fish Market

How does changing price affect the demand for fish?

We observe quantity sold, which depends on both the supply of fish and the demand for fish

To understand the demand side, we look for determinants of supply that do not affect demand

Stormy days make it harder to catch fish, which reduces supply the next day

Therefore, use stormy weather as an instrument for what happens when there are exogenous price shocks

Can we make the case that stormy weather is a valid instrument?

27 of 52

Estimation: Two-Stage Least Squares

Regress the endogenous (usually treatment) variable on the instrument

Calculate the in-sample predicted values for treatment

Regress the outcome variable on the predicted values from step (2)

28 of 52

Why endogeneity matters

	smallClass	olo
smallClass	1.0	0.6
olo	0.6	1.0

Correlation Matrix

Data Generating Process

Model

Hypothetical Scenario

Another potentially important factor in determining test scores is whether students have access to outside learning opportunities
This is pretty similar to the parental support example from before. In fact, you could imagine outside learning opportunities being correlated with parental support (supportive parents may enroll their students in academic extracurricular activities or hire tutors)
Dropped the baseline test score from 75 to 72
The effect of being in a small class is the same as before
Having access to outside learning opportunities increases test scores by 3 points, on average
Outside learning opportunities are set up to be correlated with smallClass size.

Let’s assume this was because parents who provide their students with outside learning opportunities are also the most likely to lobby the school to place their child in a smaller class so they get more attention
Stated differently, our instrument is not exogenous, and is therefore invalid

The problem now is that we cannot directly observe this ill-defined “outside learning opportunities” variable

Plots

Top plot shows the distribution of the differences in the estimated coefficients on smallClass for a hypothetical model that includes outside learning opportunities (purple) and the model that does not (yellow) for different sample size. The exercise was repeated 100 times at each sample size. Difference gets a little tighter and is clearly not centered around zero.
Bottom plot shows a boxplot of the difference in the two estimates at a fixed sample size

Takeaway

Important unobservable factors can lead to biased estimates, i.e. IV is an extremely useful and important method

29 of 52

Exogenous, but Irrelevant

	smallClass	olo
smallClass	1.0	0.6
olo	0.6	1.0

Correlation Matrix

Data Generating Process

First Stage

Second Stage

Instrument -- Parent drives a white car

We are operating under the same hypothetical situation as before where there is an unobservable factor (outside learning opportunity)
We have proposed an instrument

Include an indicator for if the student’s parents drive a white car. Does this instrument meet the two criteria? (title gives it away)
The instrument is clearly exogenous (uncorrelated with the error term)
The instrument does not seem to be relevant

Regression results

First stage R-squared is basically 0, so we are not able to predict smallClass from white_car very well
F-statistic is nowhere near the suggested value of 10
Second stage R-squared is still basically 0
Standard errors are absolutely huge. Basically, predicted treatment values tell us nothing, which isn’t surprising since we smallClass and white_car are constructed to be uncorrelated

Takeaways

Check first stage F-statistic as a diagnostic
Good instruments usually have a reasonable theoretical justification. You can’t just choose something and hope that it works
Knowing what to expect if you have an invalid instrument is helpful. If all you saw were the regression results on the right, you might not know what could be driving it

30 of 52

Relevant, but Endogenous

	smallClass	olo	enroll
smallClass	1.0	0.6	0.4
olo	0.6	1.0	0.6
enroll	0.4	0.6	1.0

Correlation Matrix

Data Generating Process

First Stage

Second Stage

Hypothetical

Similar hypothetical scenario as before
We have decided on a different instrument, one that we know must be relevant
Higher levels of enrollment are likely to lead to large classes on average. Therefore, lower enrollment is associated with small classrooms. Voila, we have our correlation between the proposed instrument and small class sizes

Regression Output

Compared to last time, we are able to explain much more of the variation in smallClass with change in enrollment (r-squared of 0.224)
The F-statistic from the first stage is also well above 10
Second stage r-squared is also no longer zero and the standard errors look less crazy
What about the second stage looks odd?

Does that coefficient on predicted_treatment seem a little high?
Enrollment is correlated with outside learning opportunity, so some of the effect of OLO is being incorrectly attributed to enrollment, i.e. the estimate is biased

31 of 52

Valid, but Weak Instrument

	olo	treatment	instrument
olo	1.0	0.6	0.1
treatment	0.6	1.0	0.0
instrument	0.1	0.0	1.0

Correlation Matrix

Data Generating Process

First Stage

Second Stage

32 of 52

Valid Instrument

	olo	treatment	instrument
olo	1.0	0.6	0.6
treatment	0.6	1.0	0.0
instrument	0.6	0.0	1.0

Correlation Matrix

Data Generating Process

First Stage

Second Stage

Instrument -- Percent change in enrollment

33 of 52

Regression Discontinuity

Used when treatment depends on crossing some threshold

Often used with observational data

Two Types

Sharp -- Crossing the threshold guarantees treatment
Fuzzy -- Crossing the threshold increase the probability of treatment

34 of 52

Sharp RDD

Add an indicator for the threshold that determines treatment and regress

Data Generating Process

Model

35 of 52

Fuzzy RDD

Use the threshold as an instrument for treatment and estimate with two stage least squares

	big	split
big	1.0	0.8
split	0.8	1.0

Correlation Matrix

Data Generating Process

First Stage

Second Stage

36 of 52

Difference in Difference

Used when there are differences between treatment groups unrelated to the treatment

Baseline data required
ATE with only time 2 data is P2 - S2
Diff-in-diff estimate is P2 - [P1 - S1]
Requires the “parallel trends” assumption

37 of 52

Diff-in-Diff Estimation

Data Generating Process

Method 1

Method 2

38 of 52

References for Core Methods

39 of 52

Matching

Another approach to solving the selection bias problem

Definition

Any method that aims to balance the distribution of covariates between two (or more) groups

Objective -- Approximate a RCT with observational data

Brief History

Initial work in the 1940’s
Theoretical work began in the 1970s
Canonical work on propensity scores in 1983

Advantages

Can be used to complement other methods (OLS, IV, Diff-in-Diff)
Explicitly highlights insufficient overlap between groups
Straightforward diagnostics

Assumptions

SUTVA
Unconfoundedness

History

Early work on matching was univariate, i.e. find a close match along a single dimension
It wasn’t clear how to match along multiple dimensions efficiently
Propensity scores were a solution to matching in higher dimensions by reducing them effectively to a single dimension via a “score”

Advantages

With other methods (OLS, IV, etc.) you can chug along blindly with poorly balanced groups
Poor balance will impact predicted values, i.e. you will be terrible at predicting in ranges outside of data you have seen

Assumptions

Stable Unit Treatment Value Assumption -- No spillover effects, i.e. the treatment you were assigned is the treatment you received
Remember that funky equation we slogged through a few weeks ago? That was about unconfoundedness. Treatment is “as if” randomly assigned after accounting for some covariates

In words, adjusting for fixed differences in a fixed set of covariates removes biases in comparisons between treatment and control groups
Non-parametric (no model) version of what we have done before
It’s just another way to solve the selection problem

Uses

Make observational data appear as though it were randomized so that treatment effects can be teased out
Can also be used during early phase of study design to decide what participants should receive follow up and which should not

40 of 52

4 Key Steps

Define a measure of closeness

Determine what covariates to include -- Goal is to satisfy unconfoundedness assumption
Select a distance measure

Exact matching
Mahalanobis -- distance from the distribution
Propensity Score -- models the probability of treatment
Linear Propensity Score
Prognosis Score -- models the outcome of each individual under the control condition

Implement a matching method that uses (1)

Nearest Neighbor
Subclassification
Full Matching
Weighting

Assess the quality of the matched sample

Standardized differences
Variance ratios
QQ plots, histograms, box plots, and plots of standardized differences

Analyze the outcome and estimate the treatment effect using

Nearest Neighbor -- Proceed as if matched sample is the result of SRS. Estimate ATE via a model
Subclassification, Full Matching, and Weighting -- Estimate effects within each subclass

Closeness

Most common measure of closeness are propensity scores
The purpose

Matching method

Nearest neighbor is the most common method here

Used for determining which controls should be selected for follow up
Could be nearest single neighbor or nearest n-neighbors. User’s choice.

Prognosis Scores

Relatively new (2008)
Requires a model for the relationship between the covariates and the outcome
Output is the predicted outcome for each individual under the control condition

Subclassification, Full Matching, and Weighting use all observations, unlike nearest neighbor, which throws data away
Subclassification forms groups of individuals that are similar (quintiles of propensity scores is common)
Full matching tries to select the number of subclasses automatically using some global distance metric
Weighting

Kernel weighting assigns higher weight to individuals that are closer and lower weight to those that are less similar

Larger caliper increases bias (less similarity between “average” individual and control), but reduces variance

Assess the Quality

41 of 52

Example

Goal -- Use observational data to assess the impact of class sizes on test scores

Select age, sex, parental income, and teacher performance ratings as matching criteria
Calculate propensity scores:
Match each student in a small class to their closest match in a large class
Compute the standardized difference in means along each covariate
Estimate the ATE using the matched sample:

42 of 52

References for Matching

43 of 52

Part 3: ML and Causal Inference

Late 00’s to present has seen a smattering of methods aimed at applying machine learning methods to causal inference

Key Developments*

Bayesian Additive Regression Trees
Post-selection Inference
SuperLearner
Interpretable Modelling
Causal Trees and Forests
G-Estimation
Double ML

* The following list comes from slides created by Skipper Seabold

44 of 52

ML and Matching

Traditionally, propensity scores have been estimated using a logit or probit model

Why not use some other SL/ML method that can output probabilities?

CART
Random Forest
Gradient Boosting Machines
SVM

45 of 52

Double ML -- Fishing Bans and Coral Health

Source Code

�Goal -- Estimate the effect of a fishing ban on coral reef health

Variables -- treatment, fish biomass , coral health variables (size, height, % sand, % hard coral)

Fishing ban non-randomly assigned

Intervention Objective -- Increase fish population in the short-run, improve coral health in the long run�

46 of 52

Procedure

Split the available data into two disjoint sets
Use the first split to estimate the relationship between fishing ban and 5 predictors

Compute the residualized propensity scores on the second split

Use the first split to estimate the relationship between biomass and 4 predictors

Compute residuals on the second split

Reverse the roles of the first and second splits. Repeat steps 1-5
Stack the residuals from step 3 and separately from step 5. Estimate the causal effect

47 of 52

References for ML and Causal Inference

48 of 52

Future Topics

Fixed and Random effects
Multilevel/Hierarchical models
Synthetic Controls
Time series methods
Explore Other ML for causal inference methods

49 of 52

Extra

50 of 52

Ics and Ings

Statistics, econometrics, statistical learning, and machine learning… what is the difference?

All are built on top of probability theory and linear algebra
Statistics

High purity (emphasis on validity)
Objective is usually valid inference
Models or algorithms

Econometrics

Medium purity
Heavily focused on causality. Willing to stomach some strong assumptions to get it
Model-based

Statistical Learning

Medium purity
Heavily focused on explanation and prediction
Algorithms

Machine learning

Pure practicality
Focused on prediction to the exclusion (sometimes) of explanation
Algorithms

51 of 52

Models and Algorithms

A model is a statement about the data generating process, i.e. how the world works

An algorithm is a way to compute something

When doing econometrics, the objective is to study causal relationships, so we are in the land of models

52 of 52

Types of Data

Experimental
Quasi-Experimental
Observational

We will be working with simulated data of each type over the next hour