1 of 28

Transformer Feed-Forward Layers Are Key-Value Memories [1]

Nils Rethmeier

January, 2021

Key insights:

feed-forward layers in transformer LM act as key-value memories
each key correlates with textual training data patterns
each value induces a distribution over the output vocabulary
lowerer layers favor syntax, higher layers favor ‘semantics’
FF layers increasingly refine memory composition upwards

2 of 28

Recall Self-attention http://jalammar.github.io/illustrated-transformer/

https://arxiv.org/pdf/1706.03762.pdf

3 of 28

Recall Self-attention http://jalammar.github.io/illustrated-transformer/

Linear: weight query

Linear weight: key

2 token embeddings

= linear layers pretrained via SSL objective

monster

https://arxiv.org/pdf/1706.03762.pdf

Linear: weight value

4 of 28

Recall Self-attention http://jalammar.github.io/illustrated-transformer/

Linear: weight query

Linear weight: key

2 token embeddings

= linear layers pretrained via SSL objective

monster

https://arxiv.org/pdf/1706.03762.pdf

Encoder Attention https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190

q = the current position-word vector in the input sequence → all q make up Q
K = all the position-word vectors in the input sequence
V = all the position-word vectors in the input sequence

Linear: weight value

5 of 28

Recall Self-attention http://jalammar.github.io/illustrated-transformer/

scores for q, k combos

gradient stabilizer

self-attention scores

2 token embeddings

= linear layers pretrained via SSL objective

monster

https://arxiv.org/pdf/1706.03762.pdf

Encoder Attention https://medium.com/@b.terryjack/deep-learning-the-transformer-9ae5e9c5a190

q = the current position-word vector in the input sequence → all q make up Q
K = all the position-word vectors in the input sequence
V = all the position-word vectors in the input sequence

Decoder Attention: QKV are in the output sequence

Linear: weight query

Linear weight: key

Linear: weight value

6 of 28

Recall Self-attention http://jalammar.github.io/illustrated-transformer/

scores for q, k combos

gradient stabilizer

self-attention scores

2 token embeddings

= linear layers pretrained via SSL objective

Wi^Q/K/V = i is for a single SA head

monster

https://arxiv.org/pdf/1706.03762.pdf

7 of 28

Recall Self-attention http://jalammar.github.io/illustrated-transformer/

https://arxiv.org/pdf/1706.03762.pdf

linear

concat

8 of 28

Recall Self-attention http://jalammar.github.io/illustrated-transformer/

https://arxiv.org/pdf/1706.03762.pdf

9 of 28

Key patterns, value sub-vocabs [1]

Transf. point-wise FFN layer [2]

Persistent Memory SA FNN [3]

Idea: [3] showed that [2] and [3] can learn the same. [3] is a key-val memory net. So is [1] (Transformers) a key-val memory

linear = W₁ … 4x the size of z₁

http://nlp.seas.harvard.edu/2018/04/03/attention.html#position-wise-feed-forward-networks

self-attention out z₁= x

linear = W₂

ReLU + Dropout

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

10 of 28

Key patterns, value sub-vocabs [1]

Transf. point-wise FFN layer [2]

Persistent Memory SA FNN [3]

Idea: [3] showed that [2] and [3] can learn the same. [3] is a key-val memory net. So is [1] (Transformers) a key-val memory

linear = W₁ … 4x the size of z₁

http://nlp.seas.harvard.edu/2018/04/03/attention.html#position-wise-feed-forward-networks

self-attention out z₁= x

linear = W₂

ReLU + Dropout

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

Transformer Self-attention

Memory Self-attention

11 of 28

Key patterns, value sub-vocabs [1]

Transf. point-wise FFN layer [2]

Persistent Memory SA FNN [3]

Idea: [3] showed that [2] and [3] can learn the same. [3] is a key-val memory net. So is [1] Transf. FFN a key-val memory?

linear = W₁ … 4x the size of z₁

http://nlp.seas.harvard.edu/2018/04/03/attention.html#position-wise-feed-forward-networks

self-attention out z₁= x

linear = W₂

ReLU + Dropout

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

Transformer Self-attention

Memory Self-attention

12 of 28

Key patterns, value sub-vocabulary

v_⏺ are distributions over vocabulary words (topics)
k₁..k_dm are text patterns correlations/ weights

an input vector x₅ is

multiplied by keys k₁..k_dm to produce a memory coefficient m₂=1.5 for v₂

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

Is FNN in [2] = [3]?

13 of 28

Key patterns, value sub-vocabulary

v_⏺ are distributions over vocabulary words (topics)
k₁..k_dm are text patterns correlations that follow v₂

an input vector x₅ is

multiplied by keys k₁..k_dm to produce a memory coefficient m₂=1.5 for v₂

Is FNN in [2] = [3]?

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

14 of 28

Key patterns, value sub-vocabulary

v_⏺ are distributions over next vocabulary words (topics)
k₁..k_dm are text patterns correlations that follow v₂

an input vector x₅ is

multiplied by keys k₁..k_dm to produce a memory

coefficient m₂=1.5 for v₂

Is FNN in [2] = [3]?

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

15 of 28

Key patterns, value sub-vocabulary

Approach:

collect k_i key patterns
collect training set sentence prefixes S where memory coeff m is largest
ask humans to identify text patterns in S

Is FNN in [2] = [3]?

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

16 of 28

Key patterns, value sub-vocabulary

Approach:

collect k_i key patterns

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

they analyze wikitext-103 transformer LM https://arxiv.org/abs/1809.10853 10 keys/ layer (160)

per key, collect 25 prefixes x_i with best memory coeffs = ReLU(x_i*k)

at least 3 prefixes x_iper pattern

deeper layers

17 of 28

Key patterns results

Approach:

collect k_i key patterns

Insight 1: keys ‘cluster’ patterns

humans could find at least one pattern per key k
avg 3.6 patterns per key
most of the 25 top prefixes belong to a pattern

Insight 2:

lower layers collect shallow patterns

prefixes often share last word

higher layers capture more semantic patterns

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

18 of 28

Key patterns results

Approach:

collect k_i key patterns

Insight 1: keys ‘cluster’ patterns

humans could find at least one pattern per key k
avg 3.6 patterns per key
most of the 25 top prefixes belong to a pattern

Insight 2:

lower layers collect shallow patterns

prefixes often share last word

higher layers capture more semantic patterns

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

further confirms

19 of 28

Values = output vocab distributions

Approach:

collect k_i key patterns
convert value v_ito word probs where , where E is the output embedding matr.
get top-1 ranked word v* arg- max(p) for each dim/ layer
get the first word w* of the top-1 trigger example (x*, k) with maximal memory coeff m
measure agreement v* = w*

Insight 3:

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

20 of 28

Values = output vocab distributions

Approach:

collect k_i key patterns
convert value v_ito word probs where , where E is the output embedding matr.
get top-1 ranked word v* arg- max(p) for each dim/ layer
get the first word w* of the top-1 trigger example (x*, k) with maximal memory coeff m
measure agreement v* = w*

Insight 3:

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

21 of 28

Values = output vocab distributions

Approach:

collect k_i key patterns
convert value v_ito word probs where , where E is the output embedding matr.
get top-1 ranked word v* arg- max(p) for each dim/ layer
get the first word w* of the top-1 trigger example (x*, k) with maximal memory coeff m
measure agreement v* = w*

Insight 3:

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

22 of 28

Values = output vocab distributions

Approach:

collect k_i key patterns
convert value v_ito word probs where , where E is the output embedding matr.
get top-1 ranked word v* arg- max(p) for each dim/ layer
get the first word w* of the top-1 trigger example (x*, k) with maximal memory coeff m
measure agreement v* = w*
measure agreement for v* =

vs. |p| (x-axis)

Insight:

3. memory cells recall how to predict next word in higher layers

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

23 of 28

Values = output vocab distributions

Approach:

collect k_i key patterns
convert value v_ito word probs where , where E is the output embedding matr.
get top-1 ranked word v* arg- max(p) for each dim/ layer
get the first word w* of the top-1 trigger example (x*, k) with maximal memory coeff m
measure agreement v* = w*
measure agreement for v* =

vs. |p| (x-axis)

Insight:

3. memory cells recall how to predict next word in higher layers

4. likely values v* agree more

with key patterns k

[2] “Attentions is all you Need”, [3] “Augmenting Self-attention with Persistent Memory”

24 of 28

Inference = composed memories

recall, upper layers are more semantic

25 of 28

Inference = composed memories

recall, upper layers are more semantic

no single memory predicts the output → memory composition is required to predict outputs

26 of 28

Layer-wise prediction refinement

Residual connections r sequentially compose predictions to produce the final output o^last

Insight:

5. layers refine predictions via r

6. hard decision in upper layers

27 of 28

Residuals and FFN compose outs

% Layer output top prediction top(o^layer): matches either top prediction of:

the FFN
the residual
both (agree)
neither of them (composition)

Insight:

7. residual decides most predictions

8. FFN has almost no influence

9. residual and FNN compose the rest of layer/ model predictions

1 of 28

2 of 28

3 of 28

4 of 28

5 of 28

6 of 28

7 of 28

8 of 28

9 of 28

10 of 28

11 of 28

12 of 28

13 of 28

14 of 28

15 of 28

16 of 28

17 of 28

18 of 28

19 of 28

20 of 28

21 of 28

22 of 28

23 of 28

24 of 28

25 of 28

26 of 28

27 of 28

28 of 28