1 of 25

Data 102: Lecture 7

Identification Conditions for Regression

Jacob Steinhardt

UC Berkeley, Spring 2020

2 of 25

Announcements

HW1 due today

HW2 released tonight, due 02/25 (2 weeks)

3 of 25

So far...

Decision theory (Lectures 1-5)

Characterizing sources of error: data collection, modeling

  • Data: unrepresentative data, � observational data
  • Modeling: poor model fit (model � mis-specification / bias)

Also discussed: controls

World

Train Data

Model

Test Data

Predictions

4 of 25

This Time

Identification Theorems

“Under what conditions does our estimator output the correct model parameters?”

  • Important if we want to interpret the parameters themselves
  • Also provides insight into what our estimator is doing

Gauss-Markov theorem (linear regression)

Moment-matching conditions (logistic regression)

5 of 25

But first...

Decision theory in action:

  • You have $1 million to spend on global health interventions (i.e., reducing disease in developing countries). How do you decide what to spend it on?
  • “Disability-adjusted life year” (DALY)

6 of 25

7 of 25

Brainstorming

  • Can you think of issues with using this as a metric for making decisions?

  • How would you go about determining the weights for DALYs?

8 of 25

Identification Conditions

9 of 25

Motivating Question: Robot Dynamics

You are designing a robot. It has actuators (moving the robot), and sensors (knowing where the robot is).

You know Newton’s laws (F = ma), and in an ideal case where all parts act as intended, the force F is a linear function of the actuator inputs and robot position.

Both the sensors and actuators may be noisy.

If we run the robot and fit a linear model to our data, will the noise cause our model parameters to be wrong? Does it matter what type of noise?

10 of 25

Review: Ordinary Least Squares

Observe data (x(1), y(1)), …, (x(n), y(n)) where x(i)∈Rd and y(i)∈R

Ordinary least squares (OLS): minimize

Alternate notation: minimize

OLS solution:

11 of 25

Gauss-Markov: Formal Setting

We formalize linear regression as follows:

Observe data points x(1), …, x(n).

For each x(i), also observe an output y(i).

Assume y(i) = <𝛽*, x(i)> + ε(i), where the errors ε(i) are independent.

Notes:� - This is often called the “fixed-design” setup (x(i) known, only y(i) are random)� - Doesn’t imply y is linear in x (ε could depend on x and need not be mean-zero)

12 of 25

Gauss-Markov Theorem

Theorem. Suppose that for each i, E[ε(i) | x(i)] = 0. Then 𝛽^ is an unbiased estimate of 𝛽* and converges to 𝛽* given infinite samples.

Moreover, if Var[ε(i) | x(i)] is the same for all i, then 𝛽^ is the best (minimum-variance) estimate of 𝛽*.

13 of 25

Gauss-Markov: Proofs (only first part)

Proof 1: algebra (on board)

Proof 2: calculus (on board)����Generalization. Only actually need E[xε] = 0 (signal and noise uncorrelated). Note this only makes sense in “random-design” case.

14 of 25

Gauss-Markov: Implications

Linear regression can handle complicated noise as long as signal is linear.

Explains what linear regression does: it finds a linear function 𝛽Tx that is uncorrelated with the noise ε = y - 𝛽Tx.

We don’t need to worry about (zero-mean) noise in measuring y.

Question: Do we need to worry about noise in measuring x?

15 of 25

Noise in measuring x

Suppose y = <𝛽*, x> (no noise at all in y), but we only observe noisy x’ = x + z, where z is Gaussian white noise (independent variance σ2 in each coordinate).

What will OLS output?

Basic idea: E[x’x’T] = E[xxT] + σ2I (on board)

Therefore, we output (E[xxT] + σ2I)-1E[xy]

  • What is this equivalent to?

16 of 25

Robot Dynamics Revisited

Suppose our model of the robot dynamics is

a = Ax + Bu,

where x is the state, u is the actuator input, and a = d2x/dt2 is the acceleration.

  • Which variables will sensor error affect?
  • Which variables will actuator error affect?
  • Assuming the errors are all zero mean, which will affect our estimates of A and B?

17 of 25

Logistic Regression

18 of 25

Review: Logistic Regression

Observe data (x(1), y(1)), …, (x(n), y(n)) where x(i)∈Rd and y(i)∈R

Logistic regression: minimize

��

Why does this look so much more complicated than OLS?

19 of 25

Logistic Regression: Log-odds Derivation

(On board)

20 of 25

Logistic Regression: Moment Matching Conditions

Let 𝛽^ denote the minimizer of the logistic regression loss.� - How can we interpret 𝛽^?

Let p𝛽(y|x) be the predictive distribution under the logistic regression model.

Theorem. The parameter 𝛽^ is the unique parameter such that

Interpretation: logistic regression finds a model whose predicted statistics match the observed statistics according to 𝜙(x).

21 of 25

Moment matching conditions: proof

By calculus (on board)

Note: logistic loss is special here. Other non-linearities don’t necessarily have this moment-matching property.

22 of 25

Application: Fairness in Classification

Suppose that we run logistic regression and 𝜙(x) contains indicator features for each protected attribute.�

What do the moment matching conditions say about how the classifier predictions will vary across groups?

�What fairness conditions does this relate to?

23 of 25

Extension: Exponential Families

(On board)

24 of 25

Recap

Linear regression� - Gauss-Markov theorem: noise in output okay, as long as uncorrelated with signal� - Noise in covariates: acts as regularizer��Logistic regression� - Matches moments of features in observed data� - Can generalize to non-binary / structured output (exponential families)

25 of 25

That’s it for today.