1 of 27

Out-of-Distribution Generalization via Risk Extrapolation (REx)

David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Remi Le Priol, Aaron Courville

2 of 27

3 of 27

Notes on OOD reading group presentation

Started out a bit weak

Can use Yoshua’s kind of OOD story? (“you go to another country….”)
Can use the canonical cow photos

IRM story for “not paying attention to color” is unclear… shouldn’t gradients on the training envs want you to pay attention to color??
Unclear: distributions vs. Risk functions

Probabilistic view + Leon’s talk

Picture before equations!
2 Goals of REx should maybe come earlier (where/how?)
Need frequencies = 1,2,3 for sine wave example!

Also an equation saying “test risk != ½(risk1 + risk2)”

Gradient vector fields still don’t have a good place…
It’s a LOT of content overall; needs to be pared down
Double check RL spurious feature construction
Experiments should focus on TAKE-AWAYS! (write those down on the slides!)

4 of 27

The problem of spurious features

Real world example:

Background=grass �vs. �Background=water

Synthetic example:

Color=green

Vs.

Color=red

environments

5 of 27

The problem of spurious features

Real world example:

Background=grass �vs. �Background=water

Synthetic example:

Color=green

Vs.

Color=red

6 of 27

The Invariant Risk Minimization (IRM) solution:

IRM says: “If the classifier wants to pay more attention to a feature in some environments than others, then that feature is spurious”

features

classifier

ERM term regularizer

7 of 27

Leon Bottou, in a talk about IRM:

Why we need IRM

Training environments:

P1,P2,P3,P4

8 of 27

The Risk Extrapolation (REx) solution:

The “robust approach” Leon was just talking about:

Our generalization, MM-REx:

A simpler variant, V-REx:

Good performance

Consistent performance

9 of 27

The Risk Extrapolation (REx) solution:

Convex �combination

Affine combination

10 of 27

Why does REx work?

Goals of Risk Extrapolation:

Lower average risk
Similarity of risks across environments

�

Training environments

Test environment

11 of 27

Why does REx work?

Goals of Risk Extrapolation:

Lower average risk
Similarity of risks across environments

�

Training environments

Test environment

12 of 27

Why does REx work?

Goals of Risk Extrapolation:

Lower average risk
Similarity of risks across environments

�

Training environments

Test environment

Someone should have a question around now….

13 of 27

Why does REx work?

Goals of Risk Extrapolation:

Lower average risk
Similarity of risks across environments

�

Training environments

Test environment

Someone should have a question around now….

Are these extrapolations Correct??

14 of 27

Why does REx work?

Goals of Risk Extrapolation:

Lower average risk
Similarity of risks across environments

�

Training environments

Test environment

Someone should have a question around now….

Are these extrapolations Correct??

Answer: No.

15 of 27

Counterexample: sine wave regression

Train1

Test

Train 2

Suppose my model predicts Y=sin(x)...

Risk > 0

Risk = 0

Risk > 0

16 of 27

A Probabilistic view on REx:

Negative

Probabilities!?!?�

...or positive probabilities with a different loss function?

17 of 27

Results: CMNIST

18 of 27

Results: CMNIST

A note on methodology…

OOD is mostly about unknown unknowns ⇒ No tuning allowed on test distribution!

Solution:

Use development tasks for tuning.
“pre-register” evaluation experiments.

19 of 27

MM-REx vs. V-REx

Our generalization, MM-REx:

A simpler variant, V-REx:

Goals of Risk Extrapolation:

Lower average risk
Similarity of risks across environments

20 of 27

Results: CMNIST

A note on methodology…

OOD is mostly about unknown unknowns ⇒ No tuning allowed on test distribution!

Solution:

Use development tasks for tuning.
“pre-register” evaluation experiments.

21 of 27

Results: VLCS and PACS

VLCS: V(VOC2007), L(LabelMe), S(SUN09), C(Caltech).

PACS: P(photo), A(art), C(cartoon), S(sketch).

22 of 27

Results: VLCS and PACS

23 of 27

Results: RL

Spurious feature construction:

Copy part of the original state
Add (environment -dependent) noise to the original state

�Result:

ERM wants to use copied state
REx doesn’t
IRM fails for some reason

�

24 of 27

Structural equation models (from IRM paper)

Good:

Bad:

25 of 27

OMG more results! Financial prediction task

26 of 27

Why is IRM difficult to train?

Memorization minimizes �IRM and REx penalties!

Timing is everything:

Hypothesis: �To train IRM/REx:�1) learn predictive features�2) throw out spurious ones

27 of 27

Remaining Questions/Directions:

Can REx deal with changing levels of label noise?
When is linearly extrapolating risks valid/sensible?
Is there a causal interpretation of REx?
Improving OOD evaluation

Better benchmarks
Better methodology

How relevant are spurious features in practice?

What are YOUR questions?�