1 of 60

Transformers

Dr. Dinesh Kumar Vishwakarma

PROFESSOR, DEPARTMENT OF INFORMATION TECHNOLOGY

DELHI TECHNOLOGICAL UNIVERSITY, DELHI.

Webpage: http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php

Email: dinesh@dtu.ac.in

2 of 60

Introduction

    • Type of Neural Network Architecture & Consisting of Encoder, Decoder, introduced in 2017 by Vaswani et al.
    • Designed to improve the limitations of RNN and LSTM for NLP tasks. E.G. machine translation and text generation.
    • Based on the self-attention mechanism
    • Unlike RNNs, which process input sequentially, the Transformer can process the entire input sequence at once, making it faster and more efficient.

2

4/29/26

3 of 60

Overview of Transformer Model

3

4/29/26

4 of 60

Embedding

4

4/29/26

An embedding refers to a numerical representation of words or entities in a high-dimensional space, where words with similar meanings are closer to each other.

5 of 60

Embedding…

5

4/29/26

6 of 60

Embedding…

6

4/29/26

Where do we put the apple?

Problem

7 of 60

Attention

7

4/29/26

Attention captures the Context

8 of 60

What about the other words?

8

4/29/26

Attention

9 of 60

Multi-Head Attention

    • Is one embedding enough?

    • Which embedding is good?
    • Creating multiple embedding takes a lot of time and work
    • Solution: Built an embedding by modifying the existing ones.

9

4/29/26

10 of 60

Linear Transformation

    • Get new embedding from existing one

10

4/29/26

Okay

Bad

Good

11 of 60

Score

    • Good embedding get high score

11

4/29/26

12 of 60

Multi-head Attention

12

4/29/26

13 of 60

Why we need?

    • Problem of RNN
      • Vanishing Gradient
      • Sequential Processing of data.

    • Problem of LSTM
      • Slow to train
      • Words are passed sequentially and generated output sequentially too.
      • Hard to parallelize the work for processing sentences, since it has processed word by word. Not only that but there is no model of long and short-range dependencies.

13

4/29/26

14 of 60

Applications

14

4/29/26

15 of 60

Sequential data processing

    • Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning.
    • A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items. A trained model would work like this:

15

4/29/26

16 of 60

Neural Machine Translation

    • In neural machine translation, a sequence is a series of words, processed one after another. The output is, likewise, a series of words:

16

4/29/26

17 of 60

Encoder and Decoder

    • Model is composed of Encoder and Decoder
    • The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.

17

4/29/26

18 of 60

Encoder and Decoder…

    • Same is applied for machine translation
    • The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks 

18

4/29/26

19 of 60

Context

    • The context is a vector of floats. Later in image, the vectors is visualized in color by assigning brighter colors to the cells with higher values.

    • You can set the size of the context vector when you set up your model. It is basically the number of hidden units in the encoder RNN. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.

19

4/29/26

20 of 60

Setup RNN : Word embedding

    • a RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state.
    • The word needs to be represented by a vector and “word embedding” is used for it. It capture a lot the meaning/semantic information of the words (e.g. king - man + woman = queen).
    • We can use pre-trained embeddings or train our own embedding on our dataset. Embedding vectors of size 200 or 300 are typical, we're showing a vector of size four for simplicity.

20

4/29/26

21 of 60

Setup RNN

    • The next RNN step takes the second input vector and hidden state #1 to create the output of that time step. Later in the post, we’ll use an animation like this to describe the vectors inside a neural machine translation model.

21

4/29/26

22 of 60

RNN as Encoder and Decoder

    • In this visualization, each pulse for the encoder or decoder is that RNN processing its inputs and generating an output for that time step.
    • Since the encoder and decoder are both RNNs, each time step one of the RNNs does some processing, it updates its hidden state based on its inputs and previous inputs it has seen.
    • Let’s look at the hidden states for the encoder. Notice how the last hidden state is actually the context we pass along to the decoder.

22

4/29/26

23 of 60

RNN unrolled

    • The decoder also maintains a hidden states that it passes from one time step to the next.
    • Visualize a sequence-to-sequence model, This animation will make it easier to understand the static graphics that describe these models. This is called an “unrolled” view where instead of showing the one decoder, we show a copy of it for each time step. This way we can look at the inputs and outputs of each time step.

23

4/29/26

24 of 60

RNN with Attention

    • The context vector turned out to be a bottleneck for these types of models. It made challenging for the models to deal with long sentences. A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”, which highly improved the quality of machine translation systems.
    • Attention allows the model to focus on the relevant parts of the input sequence as needed.
    • At time step 7, the attention mechanism enables the decoder to focus on the word "étudiant" ("student" in french) before it generates the English translation. This ability to amplify the signal from the relevant part of the input sequence makes attention models produce better results than models without attention.

24

4/29/26

25 of 60

RNN with Attention…

    • Looking attention models at high level of abstraction, an attention model differs from a classic seq2seq model in two main ways:
      • First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder:
      • Second, an attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:
        • Look at the set of encoder hidden states it received – each encoder hidden states is most associated with a certain word in the input sentence
        • Give each hidden states a score
        • Multiply each hidden states by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores

25

4/29/26

26 of 60

Attention at Time step 4

    • This scoring exercise is done at each time step on the decoder side.

26

4/29/26

27 of 60

RNN with Attention (Encoder/Decoder)

    • Let us now bring the whole thing together in the following visualization and look at how the attention process works:

      • The attention decoder RNN takes in the embedding of the <END> token, and an initial decoder hidden state.
      • The RNN processes its inputs, producing an output and a new hidden state vector (h4). The output is discarded.
      • Attention Step: We use the encoder hidden states and the h4 vector to calculate a context vector (C4) for this time step.
      • We concatenate h4 and C4 into one vector.
      • We pass this vector through a feedforward neural network (one trained jointly with the model).
      • The output of the feedforward neural networks indicates the output word of this time step.
      • Repeat for the next time steps

27

4/29/26

28 of 60

RNN with Attention (Encoder/Decoder)…

    • Another way, it can be visualize

28

4/29/26

29 of 60

Visualization of attention

    • You can see how the model paid attention correctly when outputting "European Economic Area". In French, the order of these words is reversed ("européenne économique zone") as compared to English. Every other word in the sentence is in similar order.

29

4/29/26

30 of 60

Transformer

30

4/29/26

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).

31 of 60

31

4/29/26

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).

32 of 60

32

4/29/26

33 of 60

A High Level Look

    • In a machine translation application, transformer would take a sentence in one language, and output its translation in another. A black box manner.

33

4/29/26

34 of 60

A High Level Look…

    • Transformer as:
      • Encoder and Decoder

34

4/29/26

35 of 60

A High Level Look…

      • The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements).
      • The decoding component is a stack of decoders of the same number.

35

4/29/26

36 of 60

Conclusion

    • Attention helps in improving neural machine translation.
    • The Transformer – a model that uses attention to boost the speed with which these models can be trained.

36

4/29/26

37 of 60

Self Attention Mechanism

    • A self-attention module takes in n inputs and returns outputs.
    • The self-attention mechanism allows the inputs to interact with each other (“self”) and find out who they should pay more attention to (“attention”).
    • The outputs are aggregates of these interactions and attention scores.

37

4/29/26

38 of 60

Step involved in Self Attention

  1. Prepare inputs
  2. Initialise weights
  3. Derive keyquery and value
  4. Calculate attention scores for Input 1
  5. Calculate softmax
  6. Multiply scores with values
  7. Sum weighted values to get Output 1
  8. Repeat steps 4–7 for Input 2 & Input 3

38

4/29/26

Consider 3 inputs

39 of 60

Step involved in Self Attention…

Step1: Consider 3 inputs with dimension 4

39

4/29/26

Input 1: [1, 0, 1, 0] Input 2: [0, 2, 0, 2]Input 3: [1, 1, 1, 1]

40 of 60

Step involved in Self Attention…

    • Step 2: Initialise weights
      • Every input must have three representations & these are called key (orange), query (red), and value (purple).
      • For this example, let’s take that we want these representations to have a dimension of 3.
      • Because every input has a dimension of 4, each set of the weights must have a shape of 4×3.

40

4/29/26

41 of 60

Step involved in Self Attention…

    • Step 2: Initialise weights
      • To obtain these representations, every input (green) is multiplied with a set of weights for keys, a set of weights for querys (incorrect spelling), and a set of weights for values.
      • In our example, we initialise the three sets of weights as follows:

41

4/29/26

[[0, 0, 1],[1, 1, 0],[0, 1, 0],[1, 1, 0]]

Weights for key:

Weights for query:

[[1, 0, 1],[1, 0, 0],[0, 0, 1],[0, 1, 1]]

Weights for value:

[[0, 2, 0],[0, 3, 0],[1, 0, 3],[1, 1, 0]]

42 of 60

Step involved in Self Attention…

    • Step 3: Derive key, query and value
      • Now that we have the three sets of weights, let’s obtain the keyquery and value representations for every input.
      • Key representation for Inputs:

42

4/29/26

[0, 0, 1][1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1][0, 1, 0][1, 1, 0]

[0, 0, 1][0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0][0, 1, 0][1, 1, 0]

Input-1

Input-2

Input-3

[0, 0, 1][1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1][0, 1, 0][1, 1, 0]

[0, 0, 1][1, 0, 1, 0] [1, 1, 0] [0, 1, 1][0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0][1, 1, 1, 1] [1, 1, 0] [2, 3, 1]

A faster way is to vectorise the above operations:

43 of 60

Step involved in Self Attention…

    • Step 3: Derive key, query and value

43

4/29/26

Derive key representations from each input

44 of 60

Step involved in Self Attention…

    • Step 3: Derive key, query and value
      • Let’s do the same to obtain the value representations for every input:

44

4/29/26

[0, 2, 0][1, 0, 1, 0] [0, 3, 0] [1, 2, 3] [0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0][1, 1, 1, 1] [1, 1, 0] [2, 6, 3]

Derive value representations from each input

45 of 60

Step involved in Self Attention…

    • Step 3: Derive key, query and value
      • Let’s do the same to obtain the Query representations for every input:

45

4/29/26

[1, 0, 1][1, 0, 1, 0] [1, 0, 0] [1, 0, 2][0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2][1, 1, 1, 1] [0, 1, 1] [2, 1, 3]

Derive query representations from each input

46 of 60

Step involved in Self Attention…

    • Step 4: Attention Score for Input 1

46

4/29/26

To obtain attention scores, we take dot product between Input 1’s query (red) with all keys (orange), including itself.

[0, 4, 2][1, 0, 2] x [1, 4, 3] = [2, 4, 4][1, 0, 1]

47 of 60

Step involved in Self Attention…

    • Step 4: Attention Score for Input 2

    • Step 4: Attention Score for Input 3

47

4/29/26

[0, 4, 2][2, 2, 2] x [1, 4, 3] = [4, 8, 12][1, 0, 1]

[0, 4, 2][2, 1, 3] x [1, 4, 3] = [4, 12, 10][1, 0, 1]

48 of 60

Step involved in Self Attention…

    • Step 5: Calculate Softmax Query 1

48

4/29/26

softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

49 of 60

Step involved in Self Attention…

    • Step 6: Multiply scores with values for Query 1

49

4/29/26

The softmaxed attention scores for each input (blue) is multiplied by its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values.

1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]

50 of 60

Step involved in Self Attention…

    • Step 7: Sum weighted values to get Output 1

50

4/29/26

The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself.

51 of 60

Step involved in Self Attention…

    • Step 8: Repeat for Input 2 & Input 3

51

4/29/26

We repeat Steps 4 to 7 for Output 2 and Output 3

52 of 60

Example

    • Consider the sentence: “I am going to university” Assume that the Transformer model represents each word with a 2-dimensional embedding as follows:

    • We will compute self-attention for the word “going”, using simplified Query (Q), Key (K), and Value (V) matrices (assumed) defined as follows:

52

4/29/26

Word

Embedding (vector)

I

[1, 0]

am

[0, 1]

going

[1, 1]

to

[0, 2]

university

[2, 1]

53 of 60

Example…

    • Query, key and value are computed using:

    • where X is the matrix of all word embeddings.

    • Compute the attention output for the word “going” using the scaled dot-product attention formula:

    • Q, K and V will be same because metrices considered are same.

53

4/29/26

54 of 60

Example…

  •  

54

4/29/26

Word

Q (same as K, V)

I

[1, 0]

am

[0, 1]

going

[1, 1]

to

[0, 2]

university

[2, 1]

Word

(K_i)

Dot product (Q_{going}\cdot K_i)

I

[1, 0]

(1.1 + 1.0 = 1)

am

[0, 1]

(1.0 + 1.1 = 1)

going

[1, 1]

(1.1 + 1.1 = 2)

to

[0, 2]

(1.0 + 1.2 = 2)

university

[2, 1]

(1.2 + 1.1 = 3)

S=[1,1,2,2,3]

55 of 60

Example …

  •  

55

4/29/26

Word

I

0.707

2.028

am

0.707

2.028

going

1.414

4.113

to

1.414

4.113

university

2.121

8.340

Word

I

0.0984

am

0.0984

going

0.1995

to

0.1995

university

0.4042

56 of 60

Example …

    • Step 6 : Compute final attention output for “going”

56

4/29/26

 

57 of 60

Example…

    • Final Attention Outputs

57

4/29/26

Word

Output (Ox, Oy)

Interpretation

I

(1.21, 0.90)

Learns a mix of itself and “university” — subject linked to goal.

am

(0.63, 1.28)

Emphasizes grammatical information (high y-axis).

going

(1.11, 1.10)

Balanced between action and context.

to

(0.44, 1.53)

Grammatical connector (high y weight).

university

(1.53, 1.00)

Semantically rich — conceptually dominant.

58 of 60

Summary

    • The dimension of query and key must always be the same because of the dot product score function.
    • However, the dimension of value may be different from query and key. The resulting output will consequently follow the dimension of value.

58

4/29/26

59 of 60

Reference

59

4/29/26

60 of 60

Thank You�Contact: dinesh@dtu.ac.in �Mobile: +91-9971339840

60

4/29/26