1 of 34

Data 102: Lecture 26

Robustness and Model Mis-specification

Jacob Steinhardt

UC Berkeley, Fall 2021

2 of 34

Announcements

-2 Vitamins due Sunday (12.1: already released, 12.2: released Thurs. night)

-Midterm 2 grades have been released. Regrades open tomorrow at noon, due next Wednesday at noon

-Deadline for extra credit extended to Sunday 11:59pm

-See Ed for changes to Ramesh’s OH (project-only, signups required, one extra hour)

-Grade reports will be released Thursday

3 of 34

Data Science Pipeline

World

Train Data

Model

Test Data

Predictions

4 of 34

Maxims for Statistics and Life

“All models are wrong, but some models are useful.”� -George Box

“The map is not the territory.”� -Alfred Korzybski

“The menu is not the meal.”� -Alan Watts

5 of 34

The Traffic Map is not the Traffic...

6 of 34

Brainstorming

Think of real-world use cases of regression models. (2 minutes)� - How is the model used? (What decision is being made?)

7 of 34

Uses of Regression for Decisions

  • Make prediction about individual case (Assess risk of heart disease, recidivism, etc.)��
  • Predict the effect of an intervention (Does smoking cause cancer? Do sleeping pills increase morbidity? Does salt increase blood pressure?)

8 of 34

Case Study: Framingham risk score

Study setup:� - Cohort of 5209 subjects from Framingham, MA� - Monitored for 10 years� - See which patients develop heart disease during 10-year time frame

Risk score:� - Build (Cox) regression model to predict risk of heart disease

Brainstorming: what features would you use? (2 minutes)

9 of 34

Features in Framingham Risk Score

Age�Gender�Total and HDL cholesterol�Blood pressure�Hypertension treatment�Diabetes�Smoking

10 of 34

Limitations (2 minute brainstorm)

What are possible limitations of the Framingham risk score?

�From what part of the data science pipeline (e.g. data collection, modeling) do they arise?

11 of 34

Issue: Distribution Shift (Race)

Framingham, MA is predominantly white.

The risk score makes inaccurate predictions on other races (e.g. hispanics).

�How could we fix this?

12 of 34

Distribution Shift: Conceptual Picture

Two issues: bias (model mis-specification) and variance.�� - low-complexity model (blue):� underfits, biased extrapolations���- high-complexity model (green):� fits data but potentially high variance � off-distribution

13 of 34

Take-aways

Think about where your data come from!�

Think about possible sources of distribution shift.�

Is your model overfitting or underfitting?

14 of 34

Case Study: Salt and Blood Pressure

Intersalt study:� - 52 centers each recruited ~200 subjects� - Each center measured salt, blood pressure, and age of each subject� - Regress blood pressure on age to get rate of increase with age (slope)� - Compare this slope to average salt intake (across centers)� - Result: rate of increase positively correlated with salt intake� - Conclusion: consume less salt to avoid high blood pressure

Brainstorm: potential issues with this study (2 minutes)

15 of 34

The Actual Data

Outliers: two Brazilian tribes, Papua new Guinea, and Kenya

16 of 34

Case Study: Sleeping Pills

Numerous studies observe that patients who take sleeping pills incur higher rates of mortality, both overall and from specific causes (including cancer, heart disease, car accidents, and suicide).

This effect seems to persist even after controlling for various confounders.

However, Patorno et al. (2017) find that the effect goes away if we control for 300 different confounders at once.

What should we make of this?

17 of 34

An Aside on Controls

Often, “we controlled for A” just means that A was added as a feature in the regression model. So, “we measured the effect of X on Y controlling for A, B, C” is just code for looking at the coefficient 𝛽X in the regression

Y = 𝛽XX + 𝛽AA + 𝛽BB + 𝛽CC

Usually wrong to interpret 𝛽X as measuring a causal effect (more on this in causality unit). Better to think of as “attempting to adjust for A, B, C”.

There are fancier ways of “controlling for A” but they have similar issues.

18 of 34

Sleeping Pills Analysis: Warning Signs

Data is observational: heavily confounded.

Medication has specific biological mechanism, adverse effects likely to be specific cause of death, rather than spread across all causes.

On the other hand, most confounders (e.g. stress level) would have effects spread across causes.

BUT: Controlling incorrectly could also have problems. (What are they?)

19 of 34

Take-aways

Think (again!) about how the data was collected.� - Observational or RCT?

Look at the data! How much of the variance is explained?

Controlling isn’t magic.

Structured outputs (cause of death vs. overall mortality) can yield valuable sanity checks.

20 of 34

Recap

The map is not the territory: data and model both differ from reality.

Predictions can be inaccurate if data is skewed (Framingham score).

Look at the data! (Intersalt study)

Be very careful with causal conclusions from observational data (sleeping pills).

Next: How to make models more robust?

21 of 34

Part II: Revisiting Bias-Variance

22 of 34

Distribution Shift: Conceptual Picture

Two issues: bias (model mis-specification) and variance.�� - low-complexity model (blue):� underfits, biased extrapolations���- high-complexity model (green):� fits data but potentially high variance � off-distribution

23 of 34

Visualizing Bias and Variance

24 of 34

Visualizing Bias and Variance - Label Noise

25 of 34

Strange Behavior - Double Descent

26 of 34

Recap: Bias-Variance

Non-parametric models tend to have:� - Monotonically decreasing bias (more expressive -> less bias)� - Unimodal variance (worst when we “barely fit” the data)

Error = bias+variance can have unusual shapes:� - Monotonic (bigger = better)� - Unimodal (bigger = worse, then better)� - Double descent

Higher in-distribution accuracy often => better out-of-distribution accuracy

27 of 34

Part III: Pre-training

28 of 34

Pre-training - Motivation

Suppose we want to train a classifier to predict the political slant of news

Common situation:� - Lots of unlabeled data (all text on internet)� - Few labeled data (hand-label ~1000 random articles)� - New instances might be OOD (news changes over time)

How do we handle all the unlabeled data?� - Solution: “pre-training”

29 of 34

Pre-training - Idea

First pre-train model on unlabeled text (predict next word)

Then fine-tune model on labeled text (predict label)

Pre-training provides good inductive bias vs. training from scratch� - Learned representation of neural net is crucial here

30 of 34

Pre-training - Accuracy and Robustness

31 of 34

Zero-Shot Learning

32 of 34

Few-Shot Accuracy

33 of 34

Few-Shot Accuracy

34 of 34

That’s it for today.