1 of 50

Pretrained Transformers as Universal Computation Engines Lu et al. 2021

Pravish Sainath

2 of 50

Directions - Compare Representations

Comparison of VVS and CNN layer representations

3 of 50

How do pretrained transformers generalize to other modalities with minimal fine tuning?

5 of 50

Transformer - Architecture Recap

6 of 50

Objective

An investigation on -��Pretrained language models (transformer models that predict next token)

These work well on other language tasks that they were not explicitly trained for
Could there be a more generic mechanism learned by such networks?
How can we analyze these patterns of computations performed in transformers by decoupling the effect of the learned representations?

8 of 50

The self-attention layers of transformers pretrained on a data-rich modality are useful for arbitrary data sequences, enabling downstream transfer to different modalities.

Pretrained LMs generalize to other modalities

9 of 50

Are language models inherently capable of universal computation?

universal computation - �the ability to learn representations for predictive learning across diverse modalities

11 of 50

Tasks

To test universal computation

Numerical Computation
Image Classification
Homology

12 of 50

Tasks - Numerical Computation

Bit Memory

The Model is shown 5 bitstrings each of length 1000�
Finally, the model is shown a masked version of one of the bitstrings�
It is expected to produce the original bitstring�
The bitstrings are broken up into sequences of length 50, so that the models are fed 120 tokens of dimension 50

13 of 50

Tasks - Numerical Computation

Bit XOR

The Model is shown 2 bitstrings each of length 5�
It is expected to produce the bitwise XOR of the two bitstrings�
The models are fed 10 tokens of dimension 1 (one bit at a time)

14 of 50

Tasks - Numerical Computation

ListOps

The Model is shown a sequence of list operations (max, min, etc.)�
It is expected to parse the expression and produce the output �
The models are fed 512 tokens of dimension 15 (one token at a time)��Example�Input = [MAX 4 3 [ MIN 2 3 ] 1 0 ]�Output = 4

15 of 50

Tasks - Image Classification

CIFAR-10

The Model is shown a sequence of 4x4 image patches�
It is expected to produce the image category (0-9) as output �
The models are fed 64 tokens of dimension 16 (one patch at a time)

CIFAR-10 LRA (Long Range Arena Benchmark)

Similar to above but with grayscale and flattened vectors with 1-dim tokens�(longer sequence without much spatial inductive bias)��

16 of 50

Tasks - Homology

Remote Homology Detection

The Model is shown a sequence of amino acids (protein)�
It is expected to produce the fold prediction as output �
The models are fed 1024 tokens of dimension 25 (one token at a time)��

17 of 50

Architecture

Pretrained GPT-2 Model�Base: 768-dim embeddings with 12 layers

Output Layer
Input Layer
Layer Norm
Positional Embeddings
Self-Attention & Feed-Forward Layers

18 of 50

Operation inside a Transformer

19 of 50

Universality of Operation inside a Transformer

20 of 50

Frozen Pretrained Transformer (FPT)

21 of 50

Comparison

Performance in these tasks of different architectures

FPT
Full Transformer
Full LSTM

23 of 50

Performance comparison of FPT with others

24 of 50

1 Can PLMs transfer to different modalities?

FPT achieves comparable performance to Full Transformer�
This indicates that the internal computation of a Transformer is modality-agnostic with learned representations�
Superior (100%) performance in the bit tasks indicates large memory of FPT�

25 of 50

2 Importance of the pretraining modality?

The significance of pretraining vs the transformer architecture �
Pretraining with language fairly significantly superior than most other methods�
Pretrained ViT fares slightly better for vision tasks, it is worse for homology�
Any pretraining is better than randomly intitalized Transformers

26 of 50

3 Transformer vs LSTM Architecture

LSTM finetuned similar to FPT
Transformers (FPT) outperform LSTMs across all tasks�
Transformers are comparable to LSTMs with architectural improvements (like residual connections, etc.) in some tasks. �Positional Embeddings + Residual connections comparable to Transfo�
Demonstrates the power of Self-Attention mechanism

27 of 50

4 Compute Efficiency over random initialization

Efficiency measured as the number of gradient steps to converge for FPT vs random transformer models�
FPT converges faster�
Language pretrainining offers compute benefits for non-language tasks

28 of 50

5 Are Frozen attention layers modality-specific?

FPT attends to semantically meaningful patterns in the data�
Visualizing the first layer attention weights (softmax of query-key dot product) helps understand this relationship�
FPT gives an interpretable attention pattern despite not training the self-attention layers themselves.�
The bit tasks are more interpretable than the rest

29 of 50

5 Are Frozen attention layers modality-specific?

30 of 50

5 Are Frozen attention layers modality-specific?

31 of 50

5 Are Frozen attention layers modality-specific?

32 of 50

6 Freezing the Transf - overfitting / underfitting

FPT underfit the data vs Full Transformers => They can improve further by expanding model capacity�
FPT provides generalizable task representations�
Other Transformers overfit poorly in low-data regimes

33 of 50

7 Scaling of performance with model size

Most parameters are in the transformer layers (self-attention) �
Full transformer exhibits more overfitting and divergence during training with larger models�
FPT - increasing model size stably increases the capacity of the models. �
Likely to scale as we move towards larger models and higher-data regimes.

34 of 50

8 Is performance better because of better init?

Layer-wise mean and standard deviation from the pretrained model is used to initialize a random transformer to see if better stats of initialization improves the performance�
Using pretrained stats improves performance but not across all tasks�
PLMs have superior performance across tasks

35 of 50

8 Is performance better because of better init?

Most parameters are in the transformer layers (self-attention) �
Full transformer exhibits more overfitting and divergence during training with larger models�
FPT - increasing model size stably increases the capacity of the models. �
Likely to scale as we move towards larger models and higher-data regimes.

36 of 50

9 Finetuning Transformer Output layer only?

Fix a randomly initialized input layer and freeze all parts of the model except for the output. (similar to resevoir computing/echo state network)�
Speedups significant and performance differences observed�
Performance significantly degrades and the models also exhibit overfitting �(likely without regularization/dropout)

37 of 50

10 Finetuning Transformer Output layer only?

Tokens need to mix and form interesting representations useful for downstream tasks�
Importance of the depth of the transformer for generating representations which “mix” tokens.�
Less layers and parameters are random => Less likely for tokens to be mixed well Increasing #layers increases chances of mixing�

38 of 50

10 Finetuning Transformer Output layer only?

With finetuning layer norm

39 of 50

10 Finetuning Transformer Output layer only?

Without finetuning layer norm

40 of 50

11 Performance improves as more params trained?

In practical applications, it could be better to choose a more specialized finetuning scheme or add more trainable parameters.�
Investigate additionally finetuning the self-attention and feedforward layers, which were previously frozen. �
Add them to the list of parameters finetuned, without changing the optimization or learning rate scheme.�
Finetuning the feedforward layers can improve performance, but finetuning the attention layers can lead to divergence.�

41 of 50

11 Performance improves as more params trained?

42 of 50

12 Which model params to finetune?

Orthogonal initialization is important when input parameters are not trained�
The layer norm parameters most important for finetuning.

44 of 50

Transformers in multimodal settings

A single model could learn a variety of multimodal tasks with an attention architecture - they use distinct transformers to embed different modalities�Eg - Transformers for multimodal predictive tasks, such as images and text in ViLBERT and CLIP �
OpenAI found that some neurons learned by CLIP are activated by a particular semantic concept, regardless of language or picture input. �
Using FPT similar to DALL-E which uses a single transformer to embed both the image and text modalities generating a “universal latent space” that projects any type of input into a single latent space.�
Such a latent space would be useful for a model that could learn from many sources of supervision.

45 of 50

Transformers in transfer settings

Many works for in-modality transfers such as ViT, T5, CL�
CLIP showed that training on text in addition to images could allow for zero-shot classification via providing downstream labels as text. �
Hernandez et al. (2021) do a thorough investigation of transfer with language pretraining, notably showing transfer from English to Python�
Pretraining and fine tuning of transformer models - using adapter networks, etc.�

46 of 50

Self Attention Layers as Optimization Steps

A single transformer self-attention block can be trained to perform an optimization step towards finding a stationary point, representing the solution to the task�
The self-attention layer is a gradient step in a Hopfield network with a learning rate of 1, thus transformers are capable of storing and retrieving a large amount of patterns with an implicit energy function.�
Similar to function overloading in programming where the key-value pair is the function signature�

47 of 50

Self Attention Layers as Optimization Steps

Types of fixed points in Hopfield Net

determined by how the pattern x_iis separated from others patterns:

a global fixed point n: �no separation of a pattern from the others��
a fixed point close to a single pattern: �pattern is separated from other patterns

��

metastable state: �some patterns are similar to each other and �well separated from all other vectors.

48 of 50

Global Workspace Theory

Finetuning the input and output layers to perform �generalizable computation is similar to the �Global Workspace Theory (Baars,1993)�
GWT - there is a “blackboard” that different parts �of the brain send data to; �frozen language model as being a blackboard in this�setting.�
Language might also be a natural choice of model �for this blackboard

49 of 50

Reservoir Computing

ESN - A random RNN is frozen and only the output readout layer is trained.�Advantage = very fast to train as it is unnecessary to backpropagate over time.�
ESNs are recurrent allowing the outputs of the random frozen network to modulate future inputs. �
In FPT the input and positional embeddings are fine tuned, which allow the inputs to the frozen network to adapt to a particular modality/for a query to the frozen network to be learned.

50 of 50

Points of Discussion

General-reasoning abilities of Transformer Pretrained Language Models
Unified representations in Transformer Pretrained Language Models
Something computationally special about language?�Is this emergence of universal computations because of the usage of language model in the pretraining? Or would any sufficiently diverse sequence dataset yield similar results?
Language as a medium for cognition?�The neurons in the brain encode generic patterns and can specialize

1 of 50

2 of 50

3 of 50

4 of 50

5 of 50

6 of 50

7 of 50

8 of 50

9 of 50

10 of 50

11 of 50

12 of 50

13 of 50

14 of 50

15 of 50

16 of 50

17 of 50

18 of 50

19 of 50

20 of 50

21 of 50

22 of 50

23 of 50

24 of 50

25 of 50

26 of 50

27 of 50

28 of 50

29 of 50

30 of 50

31 of 50

32 of 50

33 of 50

34 of 50

35 of 50

36 of 50

37 of 50

38 of 50

39 of 50

40 of 50

41 of 50

42 of 50

43 of 50

44 of 50

45 of 50

46 of 50

47 of 50

48 of 50

49 of 50

50 of 50