Transformers
Dr. Dinesh Kumar Vishwakarma
PROFESSOR, DEPARTMENT OF INFORMATION TECHNOLOGY
DELHI TECHNOLOGICAL UNIVERSITY, DELHI.
Webpage: http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php
Email: dinesh@dtu.ac.in
Introduction
2
4/29/26
Overview of Transformer Model
3
4/29/26
Embedding
4
4/29/26
An embedding refers to a numerical representation of words or entities in a high-dimensional space, where words with similar meanings are closer to each other.
Embedding…
5
4/29/26
Embedding…
6
4/29/26
Where do we put the apple?
Problem
Attention
7
4/29/26
Attention captures the Context
What about the other words?
8
4/29/26
Attention
Multi-Head Attention
9
4/29/26
Linear Transformation
10
4/29/26
Okay
Bad
Good
Score
11
4/29/26
Multi-head Attention
12
4/29/26
Why we need?
13
4/29/26
Applications
14
4/29/26
Sequential data processing
15
4/29/26
Neural Machine Translation
16
4/29/26
Encoder and Decoder
17
4/29/26
Encoder and Decoder…
18
4/29/26
Context
19
4/29/26
Setup RNN : Word embedding
20
4/29/26
Setup RNN
21
4/29/26
RNN as Encoder and Decoder
22
4/29/26
RNN unrolled
23
4/29/26
RNN with Attention
24
4/29/26
RNN with Attention…
25
4/29/26
Attention at Time step 4
26
4/29/26
RNN with Attention (Encoder/Decoder)
27
4/29/26
RNN with Attention (Encoder/Decoder)…
28
4/29/26
Visualization of attention
29
4/29/26
Transformer
30
4/29/26
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
31
4/29/26
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
32
4/29/26
A High Level Look
33
4/29/26
A High Level Look…
34
4/29/26
A High Level Look…
35
4/29/26
Conclusion
36
4/29/26
Self Attention Mechanism
37
4/29/26
Step involved in Self Attention
38
4/29/26
Consider 3 inputs
Step involved in Self Attention…
Step1: Consider 3 inputs with dimension 4
39
4/29/26
Input 1: [1, 0, 1, 0] �Input 2: [0, 2, 0, 2]�Input 3: [1, 1, 1, 1]
Step involved in Self Attention…
40
4/29/26
Step involved in Self Attention…
41
4/29/26
[[0, 0, 1],�[1, 1, 0],�[0, 1, 0],�[1, 1, 0]]
Weights for key:
Weights for query:
[[1, 0, 1],�[1, 0, 0],�[0, 0, 1],�[0, 1, 1]]
Weights for value:
[[0, 2, 0],�[0, 3, 0],�[1, 0, 3],�[1, 1, 0]]
Step involved in Self Attention…
42
4/29/26
[0, 0, 1]�[1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]� [0, 1, 0]� [1, 1, 0]
[0, 0, 1]�[0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]� [0, 1, 0]� [1, 1, 0]
Input-1
Input-2
Input-3
[0, 0, 1]�[1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]� [0, 1, 0]� [1, 1, 0]
[0, 0, 1]�[1, 0, 1, 0] [1, 1, 0] [0, 1, 1]�[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]�[1, 1, 1, 1] [1, 1, 0] [2, 3, 1]
A faster way is to vectorise the above operations:
Step involved in Self Attention…
43
4/29/26
Derive key representations from each input
Step involved in Self Attention…
44
4/29/26
[0, 2, 0]�[1, 0, 1, 0] [0, 3, 0] [1, 2, 3] �[0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]�[1, 1, 1, 1] [1, 1, 0] [2, 6, 3]
Derive value representations from each input
Step involved in Self Attention…
45
4/29/26
[1, 0, 1]�[1, 0, 1, 0] [1, 0, 0] [1, 0, 2]�[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]�[1, 1, 1, 1] [0, 1, 1] [2, 1, 3]
Derive query representations from each input
Step involved in Self Attention…
46
4/29/26
To obtain attention scores, we take dot product between Input 1’s query (red) with all keys (orange), including itself.
[0, 4, 2]�[1, 0, 2] x [1, 4, 3] = [2, 4, 4]�[1, 0, 1]
Step involved in Self Attention…
47
4/29/26
[0, 4, 2]�[2, 2, 2] x [1, 4, 3] = [4, 8, 12]�[1, 0, 1]
[0, 4, 2]� [2, 1, 3] x [1, 4, 3] = [4, 12, 10]�[1, 0, 1]
Step involved in Self Attention…
48
4/29/26
softmax([2, 4, 4]) = [0.0, 0.5, 0.5]
Step involved in Self Attention…
49
4/29/26
The softmaxed attention scores for each input (blue) is multiplied by its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values.
1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]�2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]�3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]
Step involved in Self Attention…
50
4/29/26
The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself.
Step involved in Self Attention…
51
4/29/26
We repeat Steps 4 to 7 for Output 2 and Output 3
Example
52
4/29/26
Word | Embedding (vector) |
I | [1, 0] |
am | [0, 1] |
going | [1, 1] |
to | [0, 2] |
university | [2, 1] |
Example…
53
4/29/26
Example…
54
4/29/26
Word | Q (same as K, V) |
I | [1, 0] |
am | [0, 1] |
going | [1, 1] |
to | [0, 2] |
university | [2, 1] |
Word | (K_i) | Dot product (Q_{going}\cdot K_i) |
I | [1, 0] | (1.1 + 1.0 = 1) |
am | [0, 1] | (1.0 + 1.1 = 1) |
going | [1, 1] | (1.1 + 1.1 = 2) |
to | [0, 2] | (1.0 + 1.2 = 2) |
university | [2, 1] | (1.2 + 1.1 = 3) |
S=[1,1,2,2,3]
Example …
55
4/29/26
Word | | |
I | 0.707 | 2.028 |
am | 0.707 | 2.028 |
going | 1.414 | 4.113 |
to | 1.414 | 4.113 |
university | 2.121 | 8.340 |
Word | |
I | 0.0984 |
am | 0.0984 |
going | 0.1995 |
to | 0.1995 |
university | 0.4042 |
Example …
56
4/29/26
Example…
57
4/29/26
Word | Output (Ox, Oy) | Interpretation |
I | (1.21, 0.90) | Learns a mix of itself and “university” — subject linked to goal. |
am | (0.63, 1.28) | Emphasizes grammatical information (high y-axis). |
going | (1.11, 1.10) | Balanced between action and context. |
to | (0.44, 1.53) | Grammatical connector (high y weight). |
university | (1.53, 1.00) | Semantically rich — conceptually dominant. |
Summary
58
4/29/26
Reference
59
4/29/26
Thank You�Contact: dinesh@dtu.ac.in �Mobile: +91-9971339840
60
4/29/26