The Devil is in the Detail
Simple Tricks Improve Systematic Generalization of Transformers
Róbert Csordás, Kazuki Irie, Jürgen Schmidhuber
EMNLP 2021
Systematic generalization
Probably one of the major obstacles toward general AI
1+2
3
3*3
9
(1+2)*3
(1+1)*2
4
?
How do we measure it?
Train set
Test set
1+2
3
3*3
9
(1+1)*2
4
(1+2)*3
?
Train set
Test set
How do we measure it?
Existing methods
Existing methods
Existing methods
Existing methods
Revisiting the basic Transformers
Revisiting the basic Transformers
So why do they perform so badly?
Do the current results reflect the full potential of NNs?
- basic model configurations
- training details, hyper-parameter tuning, …
�
We revisit the basic model and training configurations.
Are the current SoTA settings optimal?
Have all relevant, existing techniques been tested?
Underexplored Transformer augmentations
They are relevant for systematic generalization
Hypothesis 1
+
*
a
b
c
(a+b)*c
M
a
b
c
(a+b)*c
*
a
b
a*b
+
*
Hypothesis 1
operation 1
operation 2
...
Hypothesis 1
The layers should be shared
Universal Transformers
Hypothesis 2
Hypothesis 2
Use relative positional encodings
The EOS decision problem
From Newman et al, 2020
The EOS decision problem
*length cutoff of 26 is interesting because of certain biases in SCAN. See Newman et al, 2020 for more details
Revise and improve the basics
- Especially problematic for early stopping
- Bad correlation between IID validation loss and test accuracy
IID performance of different models on different datasets. OOD performance in parenthesis
2. Early stopping using the IID validation is sub-optimal
Here would the training stop if using early stopping
3. Loss is not a good indicator of accuracy
CFQ MCD 1 dataset. Color: train iteration
3. Loss is not a good indicator of accuracy (cont’d)
3. Loss is not a good indicator of accuracy (cont’d)
iter
Good
Good
Bad
3. Loss is not a good indicator of accuracy (cont’d)
Validation loss is not a good indicator of accuracy
Test loss and accuracy on PCFG
What can we fix this issue?
?
The scaling of embeddings
The scaling of embeddings
Putting it together
Putting it together
IN: walk twice after look opposite left OUT: I_TURN_LEFT I_TURN_LEFT I_LOOK I_WALK I_WALK
IN: run twice after look OUT: I_LOOK I_RUN I_RUN
IN: jump right and jump right OUT: I_TURN_RIGHT I_JUMP I_TURN_RIGHT I_JUMP
…
Putting it together
"Did Debora Caprioglio marry The Night Heaven Fell's German writer"
"SELECT count(*) WHERE { ?x0 ns:film.writer.film ns:m.02x9q6y . ?x0 ns:people.person.nationality ns:m.0345h . FILTER ( ns:m.02qj61v != ?x0 ) . ns:m.02qj61v ns:people.person.spouse_s/ns:people.marriage.spouse|ns:fictional_universe.fictional_character.married_to/ns:fictional_universe.marriage_of_fictional_characters.spouses ?x0 }",
Putting it together
IN: echo shift remove_second K15 K16 T9 , F16 A13 A2 Y6 OUT: K16 T9 K15 K15
IN: copy remove_second shift R8 N18 H14 D8 D3 , R5 R17 P8 R12 B4 OUT: N18 H14 D8 D3 R8
IN: reverse shift reverse X14 L16 O6 G3 OUT: G3 X14 L16 O6
Putting it together
IN: Emma cleaned the boy .
OUT: * boy ( x _ 3 ) ; clean . agent ( x _ 1 , Emma ) AND clean . theme ( x _ 1 , x _ 3 )
IN: A dog scoffed .
OUT: dog ( x _ 1 ) AND scoff . agent ( x _ 2 , x _ 1 )
Putting it together
IN: Subtract -0.01 from -0.055165. OUT: -0.045165
IN: What is the distance between -1296 and 4? OUT: 1300
IN: What is the tens digit of 88? OUT: 8
IN: What is the thousands digit of 1814? OUT: 1
Concluding remarks
operation 1
operation 2
...
?
Thank you for your attention!
�Please stay tuned for our future work on �systematic generalization with Transformers.
More to come soon...
/robertcsordas/transformer_generalization