Data 102: Lecture 7
Identification Conditions for Regression
Jacob Steinhardt
UC Berkeley, Spring 2020
Announcements
HW1 due today
HW2 released tonight, due 02/25 (2 weeks)
So far...
Decision theory (Lectures 1-5)
Characterizing sources of error: data collection, modeling
Also discussed: controls
World
Train Data
Model
Test Data
Predictions
This Time
Identification Theorems
“Under what conditions does our estimator output the correct model parameters?”
Gauss-Markov theorem (linear regression)
Moment-matching conditions (logistic regression)
But first...
Decision theory in action:
Brainstorming
Identification Conditions
Motivating Question: Robot Dynamics
You are designing a robot. It has actuators (moving the robot), and sensors (knowing where the robot is).
You know Newton’s laws (F = ma), and in an ideal case where all parts act as intended, the force F is a linear function of the actuator inputs and robot position.
Both the sensors and actuators may be noisy.
If we run the robot and fit a linear model to our data, will the noise cause our model parameters to be wrong? Does it matter what type of noise?
Review: Ordinary Least Squares
Observe data (x(1), y(1)), …, (x(n), y(n)) where x(i)∈Rd and y(i)∈R
Ordinary least squares (OLS): minimize
Alternate notation: minimize
OLS solution:
Gauss-Markov: Formal Setting
We formalize linear regression as follows:
Observe data points x(1), …, x(n).
For each x(i), also observe an output y(i).
Assume y(i) = <𝛽*, x(i)> + ε(i), where the errors ε(i) are independent.
Notes:� - This is often called the “fixed-design” setup (x(i) known, only y(i) are random)� - Doesn’t imply y is linear in x (ε could depend on x and need not be mean-zero)
Gauss-Markov Theorem
Theorem. Suppose that for each i, E[ε(i) | x(i)] = 0. Then 𝛽^ is an unbiased estimate of 𝛽* and converges to 𝛽* given infinite samples.
Moreover, if Var[ε(i) | x(i)] is the same for all i, then 𝛽^ is the best (minimum-variance) estimate of 𝛽*.
Gauss-Markov: Proofs (only first part)
Proof 1: algebra (on board)
Proof 2: calculus (on board)����Generalization. Only actually need E[xε] = 0 (signal and noise uncorrelated). Note this only makes sense in “random-design” case.
Gauss-Markov: Implications
Linear regression can handle complicated noise as long as signal is linear.
Explains what linear regression does: it finds a linear function 𝛽Tx that is uncorrelated with the noise ε = y - 𝛽Tx.
We don’t need to worry about (zero-mean) noise in measuring y.
Question: Do we need to worry about noise in measuring x?
Noise in measuring x
Suppose y = <𝛽*, x> (no noise at all in y), but we only observe noisy x’ = x + z, where z is Gaussian white noise (independent variance σ2 in each coordinate).
What will OLS output?
Basic idea: E[x’x’T] = E[xxT] + σ2I (on board)
Therefore, we output (E[xxT] + σ2I)-1E[xy]
Robot Dynamics Revisited
Suppose our model of the robot dynamics is
a = Ax + Bu,
where x is the state, u is the actuator input, and a = d2x/dt2 is the acceleration.
Logistic Regression
Review: Logistic Regression
Observe data (x(1), y(1)), …, (x(n), y(n)) where x(i)∈Rd and y(i)∈R
Logistic regression: minimize
��
Why does this look so much more complicated than OLS?
Logistic Regression: Log-odds Derivation
(On board)
Logistic Regression: Moment Matching Conditions
Let 𝛽^ denote the minimizer of the logistic regression loss.� - How can we interpret 𝛽^?
Let p𝛽(y|x) be the predictive distribution under the logistic regression model.
Theorem. The parameter 𝛽^ is the unique parameter such that
Interpretation: logistic regression finds a model whose predicted statistics match the observed statistics according to 𝜙(x).
Moment matching conditions: proof
By calculus (on board)
Note: logistic loss is special here. Other non-linearities don’t necessarily have this moment-matching property.
Application: Fairness in Classification
Suppose that we run logistic regression and 𝜙(x) contains indicator features for each protected attribute.�
What do the moment matching conditions say about how the classifier predictions will vary across groups?
�What fairness conditions does this relate to?
Extension: Exponential Families
(On board)
Recap
Linear regression� - Gauss-Markov theorem: noise in output okay, as long as uncorrelated with signal� - Noise in covariates: acts as regularizer��Logistic regression� - Matches moments of features in observed data� - Can generalize to non-binary / structured output (exponential families)
That’s it for today.