Pretrained Transformers as Universal Computation Engines Lu et al. 2021
Pravish Sainath
Directions - Compare Representations
Comparison of VVS and CNN layer representations
How do pretrained transformers generalize to other modalities with minimal fine tuning?
Background
Transformer - Architecture Recap
Objective
An investigation on -��Pretrained language models (transformer models that predict next token)
Hypothesis
The self-attention layers of transformers pretrained on a data-rich modality are useful for arbitrary data sequences, enabling downstream transfer to different modalities.
Pretrained LMs generalize to other modalities
Are language models inherently capable of universal computation?
universal computation - �the ability to learn representations for predictive learning across diverse modalities
Method
Tasks
To test universal computation
Tasks - Numerical Computation
Bit Memory
Tasks - Numerical Computation
Bit XOR
Tasks - Numerical Computation
ListOps
Tasks - Image Classification
CIFAR-10
CIFAR-10 LRA (Long Range Arena Benchmark)
Tasks - Homology
Remote Homology Detection
Architecture
Pretrained GPT-2 Model�Base: 768-dim embeddings with 12 layers
Operation inside a Transformer
Universality of Operation inside a Transformer
Frozen Pretrained Transformer (FPT)
Comparison
Performance in these tasks of different architectures
Results
Performance comparison of FPT with others
1 Can PLMs transfer to different modalities?
2 Importance of the pretraining modality?
3 Transformer vs LSTM Architecture
4 Compute Efficiency over random initialization
5 Are Frozen attention layers modality-specific?
5 Are Frozen attention layers modality-specific?
5 Are Frozen attention layers modality-specific?
5 Are Frozen attention layers modality-specific?
6 Freezing the Transf - overfitting / underfitting
7 Scaling of performance with model size
8 Is performance better because of better init?
8 Is performance better because of better init?
9 Finetuning Transformer Output layer only?
10 Finetuning Transformer Output layer only?
10 Finetuning Transformer Output layer only?
With finetuning layer norm
10 Finetuning Transformer Output layer only?
Without finetuning layer norm
11 Performance improves as more params trained?
11 Performance improves as more params trained?
12 Which model params to finetune?
Discussion
Transformers in multimodal settings
Transformers in transfer settings
Self Attention Layers as Optimization Steps
Self Attention Layers as Optimization Steps
Types of fixed points in Hopfield Net
determined by how the pattern xi is separated from others patterns:
��
Global Workspace Theory
Reservoir Computing
Points of Discussion