1 of 68

The Illustrated Transformer

Slide version of the article The Illustrated Transformer by Jay Alammar

Date: April 13, 2020

2 of 68

A High Level Look

Looking at the model as a single black box

In a machine translation application, it would take a sentence in one language, and output its translation in another.

3 of 68

A High Level Look

Popping open that Optimus Prime goodness,

we see an encoding component, a decoding component, and connections between them.

4 of 68

A High Level Look

The encoding component is a stack of encoders. The decoding component is a stack of decoders of the same number.

5 of 68

A High Level Look

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sublayers:

6 of 68

A High Level Look

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later.

7 of 68

A High Level Look

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

8 of 68

A High Level Look

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models?).

9 of 68

Bringing The Tensors Into The Picture

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

10 of 68

Bringing The Tensors Into The Picture

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.

11 of 68

Bringing The Tensors Into The Picture

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

12 of 68

Bringing The Tensors Into The Picture

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

13 of 68

Now We’re Encoding! - An Example

An encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

14 of 68

Self-Attention at a High Level

Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

15 of 68

Self-Attention at a High Level

16 of 68

Self-Attention at a High Level

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

17 of 68

Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.

18 of 68

Self-Attention in Detail

Multiplying x₁ by the W^Q weight matrix produces q₁, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

19 of 68

Self-Attention in Detail

What are the “query”, “key”, and “value” vectors?

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated later, you’ll know pretty much all you need to know about the role each of these vectors plays.

20 of 68

Self-Attention in Detail

The second step in calculating self-attention is to calculate a score.

Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

21 of 68

Self-Attention in Detail

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q₁ and k₁. The second score would be the dot product of q₁ and k₂.

22 of 68

Self-Attention in Detail

The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.

23 of 68

Self-Attention in Detail

This softmax score determines how much how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.

24 of 68

Self-Attention in Detail

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

25 of 68

Self-Attention in Detail

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

26 of 68

Self-Attention in Detail

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network.

In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

27 of 68

Self-Attention in Detail

Summary for Vector Calculation

Create three vectors from each of the encoder’s input vectors.
Calculate a score using dot product of a Query and Keys.
Divide the scores by 8 ( ).
Pass the result through a softmax operation.
Multiply each value vector by the softmax score.
Sum up the weighted value vectors.

28 of 68

Matrix Calculation of Self-Attention

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (W^Q, W^K, W^V).

Every row in the X matrix corresponds to a word

in the input sentence. We again see the difference

in size of the embedding vector (512, or 4 boxes in

the figure), and the q/k/v vectors (64, or 3 boxes in

the figure)

29 of 68

Matrix Calculation of Self-Attention

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

30 of 68

The Beast With Many Heads

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:

It expands the model’s ability to focus on different positions.
It gives the attention layer multiple representation subspaces.

31 of 68

The Beast With Many Heads

It expands the model’s ability to focus on different positions.

Yes, in the example above, z₁ contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.

32 of 68

The Beast With Many Heads

It gives the attention layer multiple “representation subspaces”.

33 of 68

The Beast With Many Heads

As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder).

Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

34 of 68

The Beast With Many Heads

With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the W^Q/W^K/W^V matrices to produce Q/K/V matrices.

35 of 68

The Beast With Many Heads

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

36 of 68

The Beast With Many Heads

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

How do we do that? We concat the matrices then multiply them by an additional weights matrix W^O.

37 of 68

The Beast With Many Heads

That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place.

38 of 68

The Beast With Many Heads

Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:

As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" -- in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

39 of 68

The Beast With Many Heads

If we add all the attention heads to the picture, however, things can be harder to interpret:

40 of 68