1 of 25

Data 102: Lecture 7

Identification Conditions for Regression

Jacob Steinhardt

UC Berkeley, Spring 2020

2 of 25

Announcements

HW1 due today

HW2 released tonight, due 02/25 (2 weeks)

3 of 25

So far...

Decision theory (Lectures 1-5)

Characterizing sources of error: data collection, modeling

Data: unrepresentative data, � observational data
Modeling: poor model fit (model � mis-specification / bias)

Also discussed: controls

World

Train Data

Model

Test Data

Predictions

4 of 25

This Time

Identification Theorems

“Under what conditions does our estimator output the correct model parameters?”

Important if we want to interpret the parameters themselves
Also provides insight into what our estimator is doing

Gauss-Markov theorem (linear regression)

Moment-matching conditions (logistic regression)

5 of 25

But first...

Decision theory in action:

You have $1 million to spend on global health interventions (i.e., reducing disease in developing countries). How do you decide what to spend it on?
“Disability-adjusted life year” (DALY)

7 of 25

Brainstorming

Can you think of issues with using this as a metric for making decisions?

How would you go about determining the weights for DALYs?

8 of 25

Identification Conditions

9 of 25

Motivating Question: Robot Dynamics

You are designing a robot. It has actuators (moving the robot), and sensors (knowing where the robot is).

You know Newton’s laws (F = ma), and in an ideal case where all parts act as intended, the force F is a linear function of the actuator inputs and robot position.

Both the sensors and actuators may be noisy.

If we run the robot and fit a linear model to our data, will the noise cause our model parameters to be wrong? Does it matter what type of noise?

10 of 25

Review: Ordinary Least Squares

Observe data (x⁽¹⁾, y⁽¹⁾), …, (x⁽ⁿ⁾, y⁽ⁿ⁾) where x⁽ⁱ⁾∈R^d and y⁽ⁱ⁾∈R

Ordinary least squares (OLS): minimize

Alternate notation: minimize

OLS solution:

11 of 25

Gauss-Markov: Formal Setting

We formalize linear regression as follows:

Observe data points x⁽¹⁾, …, x⁽ⁿ⁾.

For each x⁽ⁱ⁾, also observe an output y⁽ⁱ⁾.

Assume y⁽ⁱ⁾ = <𝛽*, x⁽ⁱ⁾> + ε⁽ⁱ⁾, where the errors ε⁽ⁱ⁾ are independent.

Notes:� - This is often called the “fixed-design” setup (x⁽ⁱ⁾ known, only y⁽ⁱ⁾ are random)� - Doesn’t imply y is linear in x (ε could depend on x and need not be mean-zero)

12 of 25

Gauss-Markov Theorem

Theorem. Suppose that for each i, E[ε⁽ⁱ⁾| x⁽ⁱ⁾] = 0. Then 𝛽^{^} is an unbiased estimate of 𝛽* and converges to 𝛽* given infinite samples.

Moreover, if Var[ε⁽ⁱ⁾| x⁽ⁱ⁾] is the same for all i, then 𝛽^{^} is the best (minimum-variance) estimate of 𝛽*.

13 of 25

Gauss-Markov: Proofs (only first part)

Proof 1: algebra (on board)

Proof 2: calculus (on board)��Generalization. Only actually need E[xε] = 0 (signal and noise uncorrelated). Note this only makes sense in “random-design” case.

14 of 25

Gauss-Markov: Implications

Linear regression can handle complicated noise as long as signal is linear.

Explains what linear regression does: it finds a linear function 𝛽^Tx that is uncorrelated with the noise ε = y - 𝛽^Tx.

We don’t need to worry about (zero-mean) noise in measuring y.

Question: Do we need to worry about noise in measuring x?

15 of 25

Noise in measuring x

Suppose y = <𝛽*, x> (no noise at all in y), but we only observe noisy x’ = x + z, where z is Gaussian white noise (independent variance σ² in each coordinate).

What will OLS output?

Basic idea: E[x’x’^T] = E[xx^T] + σ²I (on board)

Therefore, we output (E[xx^T] + σ²I)^-1E[xy]

What is this equivalent to?

16 of 25

Robot Dynamics Revisited

Suppose our model of the robot dynamics is

a = Ax + Bu,

where x is the state, u is the actuator input, and a = d²x/dt² is the acceleration.

Which variables will sensor error affect?
Which variables will actuator error affect?
Assuming the errors are all zero mean, which will affect our estimates of A and B?

17 of 25

Logistic Regression

18 of 25

Review: Logistic Regression

Observe data (x⁽¹⁾, y⁽¹⁾), …, (x⁽ⁿ⁾, y⁽ⁿ⁾) where x⁽ⁱ⁾∈R^d and y⁽ⁱ⁾∈R

Logistic regression: minimize

��

Why does this look so much more complicated than OLS?

19 of 25

Logistic Regression: Log-odds Derivation

(On board)

20 of 25

Logistic Regression: Moment Matching Conditions

Let 𝛽^{^} denote the minimizer of the logistic regression loss.� - How can we interpret 𝛽^{^}?

Let p_𝛽(y|x) be the predictive distribution under the logistic regression model.

Theorem. The parameter 𝛽^{^} is the unique parameter such that

Interpretation: logistic regression finds a model whose predicted statistics match the observed statistics according to 𝜙(x).

21 of 25

Moment matching conditions: proof

By calculus (on board)

Note: logistic loss is special here. Other non-linearities don’t necessarily have this moment-matching property.

22 of 25

Application: Fairness in Classification

Suppose that we run logistic regression and 𝜙(x) contains indicator features for each protected attribute.�

What do the moment matching conditions say about how the classifier predictions will vary across groups?

�What fairness conditions does this relate to?

23 of 25

Extension: Exponential Families