Data 102: Lecture 26
Robustness and Model Mis-specification
Jacob Steinhardt
UC Berkeley, Fall 2021
Announcements
-2 Vitamins due Sunday (12.1: already released, 12.2: released Thurs. night)
-Midterm 2 grades have been released. Regrades open tomorrow at noon, due next Wednesday at noon
-Deadline for extra credit extended to Sunday 11:59pm
-See Ed for changes to Ramesh’s OH (project-only, signups required, one extra hour)
-Grade reports will be released Thursday
Data Science Pipeline
World
Train Data
Model
Test Data
Predictions
Maxims for Statistics and Life
“All models are wrong, but some models are useful.”� -George Box
“The map is not the territory.”� -Alfred Korzybski
“The menu is not the meal.”� -Alan Watts
The Traffic Map is not the Traffic...
Brainstorming
Think of real-world use cases of regression models. (2 minutes)� - How is the model used? (What decision is being made?)
Uses of Regression for Decisions
Case Study: Framingham risk score
Study setup:� - Cohort of 5209 subjects from Framingham, MA� - Monitored for 10 years� - See which patients develop heart disease during 10-year time frame
Risk score:� - Build (Cox) regression model to predict risk of heart disease
Brainstorming: what features would you use? (2 minutes)
Features in Framingham Risk Score
Age�Gender�Total and HDL cholesterol�Blood pressure�Hypertension treatment�Diabetes�Smoking
Limitations (2 minute brainstorm)
What are possible limitations of the Framingham risk score?
�From what part of the data science pipeline (e.g. data collection, modeling) do they arise?
Issue: Distribution Shift (Race)
Framingham, MA is predominantly white.
The risk score makes inaccurate predictions on other races (e.g. hispanics).
�How could we fix this?
Distribution Shift: Conceptual Picture
Two issues: bias (model mis-specification) and variance.�� - low-complexity model (blue):� underfits, biased extrapolations���- high-complexity model (green):� fits data but potentially high variance � off-distribution
Take-aways
Think about where your data come from!�
Think about possible sources of distribution shift.�
Is your model overfitting or underfitting?
Case Study: Salt and Blood Pressure
Intersalt study:� - 52 centers each recruited ~200 subjects� - Each center measured salt, blood pressure, and age of each subject� - Regress blood pressure on age to get rate of increase with age (slope)� - Compare this slope to average salt intake (across centers)� - Result: rate of increase positively correlated with salt intake� - Conclusion: consume less salt to avoid high blood pressure
Brainstorm: potential issues with this study (2 minutes)
The Actual Data
Outliers: two Brazilian tribes, Papua new Guinea, and Kenya
Case Study: Sleeping Pills
Numerous studies observe that patients who take sleeping pills incur higher rates of mortality, both overall and from specific causes (including cancer, heart disease, car accidents, and suicide).
This effect seems to persist even after controlling for various confounders.
However, Patorno et al. (2017) find that the effect goes away if we control for 300 different confounders at once.
What should we make of this?
An Aside on Controls
Often, “we controlled for A” just means that A was added as a feature in the regression model. So, “we measured the effect of X on Y controlling for A, B, C” is just code for looking at the coefficient 𝛽X in the regression
Y = 𝛽XX + 𝛽AA + 𝛽BB + 𝛽CC
Usually wrong to interpret 𝛽X as measuring a causal effect (more on this in causality unit). Better to think of as “attempting to adjust for A, B, C”.
There are fancier ways of “controlling for A” but they have similar issues.
Sleeping Pills Analysis: Warning Signs
Data is observational: heavily confounded.
Medication has specific biological mechanism, adverse effects likely to be specific cause of death, rather than spread across all causes.
On the other hand, most confounders (e.g. stress level) would have effects spread across causes.
BUT: Controlling incorrectly could also have problems. (What are they?)
Take-aways
Think (again!) about how the data was collected.� - Observational or RCT?
Look at the data! How much of the variance is explained?
Controlling isn’t magic.
Structured outputs (cause of death vs. overall mortality) can yield valuable sanity checks.
Recap
The map is not the territory: data and model both differ from reality.
Predictions can be inaccurate if data is skewed (Framingham score).
Look at the data! (Intersalt study)
Be very careful with causal conclusions from observational data (sleeping pills).
Next: How to make models more robust?
Part II: Revisiting Bias-Variance
Distribution Shift: Conceptual Picture
Two issues: bias (model mis-specification) and variance.�� - low-complexity model (blue):� underfits, biased extrapolations���- high-complexity model (green):� fits data but potentially high variance � off-distribution
Visualizing Bias and Variance
Visualizing Bias and Variance - Label Noise
Strange Behavior - Double Descent
Recap: Bias-Variance
Non-parametric models tend to have:� - Monotonically decreasing bias (more expressive -> less bias)� - Unimodal variance (worst when we “barely fit” the data)
Error = bias+variance can have unusual shapes:� - Monotonic (bigger = better)� - Unimodal (bigger = worse, then better)� - Double descent
Higher in-distribution accuracy often => better out-of-distribution accuracy
Part III: Pre-training
Pre-training - Motivation
Suppose we want to train a classifier to predict the political slant of news
Common situation:� - Lots of unlabeled data (all text on internet)� - Few labeled data (hand-label ~1000 random articles)� - New instances might be OOD (news changes over time)
How do we handle all the unlabeled data?� - Solution: “pre-training”
Pre-training - Idea
First pre-train model on unlabeled text (predict next word)
Then fine-tune model on labeled text (predict label)
Pre-training provides good inductive bias vs. training from scratch� - Learned representation of neural net is crucial here
Pre-training - Accuracy and Robustness
Zero-Shot Learning
Few-Shot Accuracy
Few-Shot Accuracy
That’s it for today.