Out-of-Distribution Generalization via Risk Extrapolation (REx)
Notes on OOD reading group presentation
The problem of spurious features
Real world example:
Background=grass �vs. �Background=water
Synthetic example:
Color=green
Vs.
Color=red
environments
The problem of spurious features
Real world example:
Background=grass �vs. �Background=water
Synthetic example:
Color=green
Vs.
Color=red
The Invariant Risk Minimization (IRM) solution:
features
classifier
ERM term regularizer
Leon Bottou, in a talk about IRM:
Why we need IRM
Training environments:
P1,P2,P3,P4
The Risk Extrapolation (REx) solution:
The “robust approach” Leon was just talking about:
Our generalization, MM-REx:
A simpler variant, V-REx:
Good performance
Consistent performance
The Risk Extrapolation (REx) solution:
Convex �combination
Affine combination
Why does REx work?
Goals of Risk Extrapolation:
�
Training environments
Test environment
Why does REx work?
Goals of Risk Extrapolation:
�
Training environments
Test environment
Why does REx work?
Goals of Risk Extrapolation:
�
Training environments
Test environment
Someone should have a question around now….
Why does REx work?
Goals of Risk Extrapolation:
�
Training environments
Test environment
Someone should have a question around now….
Are these extrapolations Correct??
Why does REx work?
Goals of Risk Extrapolation:
�
Training environments
Test environment
Someone should have a question around now….
Are these extrapolations Correct??
Answer: No.
Counterexample: sine wave regression
Train1
Test
Train 2
Suppose my model predicts Y=sin(x)...
Risk > 0
Risk = 0
Risk > 0
A Probabilistic view on REx:
Negative
Probabilities!?!?�
...or positive probabilities with a different loss function?
Results: CMNIST
Results: CMNIST
A note on methodology…
OOD is mostly about unknown unknowns ⇒ No tuning allowed on test distribution!
Solution:
MM-REx vs. V-REx
Our generalization, MM-REx:
A simpler variant, V-REx:
Goals of Risk Extrapolation:
Results: CMNIST
A note on methodology…
OOD is mostly about unknown unknowns ⇒ No tuning allowed on test distribution!
Solution:
Results: VLCS and PACS
VLCS: V(VOC2007), L(LabelMe), S(SUN09), C(Caltech).
PACS: P(photo), A(art), C(cartoon), S(sketch).
Results: VLCS and PACS
Results: RL
Spurious feature construction:
�Result:
�
Structural equation models (from IRM paper)
Good:
Bad:
OMG more results! Financial prediction task
Why is IRM difficult to train?
Memorization minimizes �IRM and REx penalties!
Timing is everything:
Hypothesis: �To train IRM/REx:�1) learn predictive features�2) throw out spurious ones
Remaining Questions/Directions:
What are YOUR questions?�