1 of 27

Out-of-Distribution Generalization via Risk Extrapolation (REx)

2 of 27

3 of 27

Notes on OOD reading group presentation

  • Started out a bit weak
    • Can use Yoshua’s kind of OOD story? (“you go to another country….”)
    • Can use the canonical cow photos
  • IRM story for “not paying attention to color” is unclear… shouldn’t gradients on the training envs want you to pay attention to color??
  • Unclear: distributions vs. Risk functions
    • Probabilistic view + Leon’s talk
  • Picture before equations!
  • 2 Goals of REx should maybe come earlier (where/how?)
  • Need frequencies = 1,2,3 for sine wave example!
    • Also an equation saying “test risk != ½(risk1 + risk2)”
  • Gradient vector fields still don’t have a good place…
  • It’s a LOT of content overall; needs to be pared down
  • Double check RL spurious feature construction
  • Experiments should focus on TAKE-AWAYS! (write those down on the slides!)

4 of 27

The problem of spurious features

Real world example:

Background=grass �vs. �Background=water

Synthetic example:

Color=green

Vs.

Color=red

environments

5 of 27

The problem of spurious features

Real world example:

Background=grass �vs. �Background=water

Synthetic example:

Color=green

Vs.

Color=red

6 of 27

The Invariant Risk Minimization (IRM) solution:

  • IRM says: “If the classifier wants to pay more attention to a feature in some environments than others, then that feature is spurious

features

classifier

ERM term regularizer

7 of 27

Leon Bottou, in a talk about IRM:

Why we need IRM

Training environments:

P1,P2,P3,P4

8 of 27

The Risk Extrapolation (REx) solution:

The “robust approach” Leon was just talking about:

Our generalization, MM-REx:

A simpler variant, V-REx:

Good performance

Consistent performance

9 of 27

The Risk Extrapolation (REx) solution:

Convex �combination

Affine combination

10 of 27

Why does REx work?

Goals of Risk Extrapolation:

  1. Lower average risk
  2. Similarity of risks across environments

Training environments

Test environment

11 of 27

Why does REx work?

Goals of Risk Extrapolation:

  • Lower average risk
  • Similarity of risks across environments

Training environments

Test environment

12 of 27

Why does REx work?

Goals of Risk Extrapolation:

  • Lower average risk
  • Similarity of risks across environments

Training environments

Test environment

Someone should have a question around now….

13 of 27

Why does REx work?

Goals of Risk Extrapolation:

  • Lower average risk
  • Similarity of risks across environments

Training environments

Test environment

Someone should have a question around now….

Are these extrapolations Correct??

14 of 27

Why does REx work?

Goals of Risk Extrapolation:

  • Lower average risk
  • Similarity of risks across environments

Training environments

Test environment

Someone should have a question around now….

Are these extrapolations Correct??

Answer: No.

15 of 27

Counterexample: sine wave regression

Train1

Test

Train 2

Suppose my model predicts Y=sin(x)...

Risk > 0

Risk = 0

Risk > 0

16 of 27

A Probabilistic view on REx:

Negative

Probabilities!?!?�

...or positive probabilities with a different loss function?

17 of 27

Results: CMNIST

18 of 27

Results: CMNIST

A note on methodology…

OOD is mostly about unknown unknowns ⇒ No tuning allowed on test distribution!

Solution:

  • Use development tasks for tuning.
  • “pre-register” evaluation experiments.

19 of 27

MM-REx vs. V-REx

Our generalization, MM-REx:

A simpler variant, V-REx:

Goals of Risk Extrapolation:

  1. Lower average risk
  2. Similarity of risks across environments

20 of 27

Results: CMNIST

A note on methodology…

OOD is mostly about unknown unknowns ⇒ No tuning allowed on test distribution!

Solution:

  • Use development tasks for tuning.
  • “pre-register” evaluation experiments.

21 of 27

Results: VLCS and PACS

VLCS: V(VOC2007), L(LabelMe), S(SUN09), C(Caltech).

PACS: P(photo), A(art), C(cartoon), S(sketch).

22 of 27

Results: VLCS and PACS

23 of 27

Results: RL

Spurious feature construction:

  • Copy part of the original state
  • Add (environment -dependent) noise to the original state

�Result:

  • ERM wants to use copied state
  • REx doesn’t
  • IRM fails for some reason

24 of 27

Structural equation models (from IRM paper)

Good:

Bad:

25 of 27

OMG more results! Financial prediction task

26 of 27

Why is IRM difficult to train?

Memorization minimizes �IRM and REx penalties!

Timing is everything:

Hypothesis: �To train IRM/REx:�1) learn predictive features�2) throw out spurious ones

27 of 27

Remaining Questions/Directions:

  • Can REx deal with changing levels of label noise?
  • When is linearly extrapolating risks valid/sensible?
  • Is there a causal interpretation of REx?
  • Improving OOD evaluation
    • Better benchmarks
    • Better methodology
  • How relevant are spurious features in practice?

What are YOUR questions?�