1
Outline
2
01 | What we will learn |
02 | Predicting the next word |
03 | Simple approaches |
04 | Word embeddings |
05 | Transformers |
06 | Back to predicting the next word |
07 | Different types of transformers |
08 | Wrap-up |
Outline
3
01 | What we will learn |
02 | Predicting the next word |
03 | Simple approaches |
04 | Word embeddings |
05 | Transformers |
06 | Back to predicting the next word |
07 | Different types of transformers |
08 | Wrap-up |
Break
Outline
4
01 | What we will learn |
02 | Predicting the next word |
03 | Simple approaches |
04 | Word embeddings |
05 | Transformers |
06 | Back to predicting the next word |
07 | Different types of transformers |
08 | Wrap-up |
Code example
Code example
What we will learn
5
What we will not learn
6
What I assume you know
7
Predicting the next word
What this enables, tokens and sequences
8
Predicting the next word
9
Predicting the next word
10
Predicting the next word
11
Predicting the next word
12
Predicting the next word
13
Predicting the next word
14
Predicting the next word
“I wandered lonely as a cloud that floats on high o'er vales and hills,
when all at once …….”
→ “I wandered lonely as a cloud that floats on high o'er vales and hills, when all at once I saw a crowd, a host of golden daffodils.”
15
William Wordsworth, 1807
Words as tokens
the cat sat on the mat → [the, cat, sat, on, the mat]
quick brown fox jumped over → [quick, brown, fox, jump, ed, over]
16
Sequences
17
Simple approaches
Naïve models that are wrong and what we can learn from that
18
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
19
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
From
To
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
20
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
21
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more you
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
22
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more you know
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
23
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more you know the
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
24
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more you know the more
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
25
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more you know the more you
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
26
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more you know the more you realize
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
27
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more you know the more you realize you
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
28
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more you know the more you realize you don’t
Simple approaches: Markov models
Example: “The more you know, the more you realize you don't know” Aristotle
29
| the | more | you | know | realize | don’t |
the | 0.00 | 0.96 | 0.01 | 0.01 | 0.01 | 0.01 |
more | 0.02 | 0.00 | 0.9 | 0.01 | 0.03 | 0.04 |
you | 0.05 | 0.05 | 0.00 | 0.3 | 0.3 | 0.3 |
know | 0.9 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 |
realize | 0.05 | 0.1 | 0.7 | 0.1 | 0.00 | 0.05 |
don’t | 0.01 | 0.01 | 0.14 | 0.5 | 0.34 | 0.00 |
The more you know the more you realize you don’t know
Simple approaches: Markov models
“know” = ( 0,0,0,1,0,0)
P(“know” | “you”) = 0.3 , P(“realize” | “you”) = 0.3 , P(“don’t” | “you”) = 0.3
Could easily have generated,
‘The more you don’t the more you realize you don’t realize’
30
Simple approaches: Markov models
Problems:
1. Our vocabulary is large, e.g. 170000 – 700000 words
2a. We only take into account the preceding word
→ “Many adults drink beer” , “Many newborn babies drink beer”
“beer” has the same probability of occurring after the word “drink” in this model
2b. Expanding to transitions between pairs of words increases state space
e.g. (many, adults) → (adults, drink) → (drink, beer)
31
Word embeddings
Correcting our mistakes
32
Word embeddings
33
“the” = (1,0,0,0,0,0)
“more” = (0,1,0,0,0,0)
“you” = (0,0,1,0,0,0)
“know” = (0,0,0,1,0,0)
“realize” = (0,0,0,0,1,0)
“don’t” = (0,0,0,0,0,1)
6-dimensional
3-dimensional
the
more
you
realize
don’t
know
Map
Word embeddings
34
All of English language
170000 – 700000 dimensional
Word2Vec: 300-dimensional
GPT-3 (Davinci): 12288-dimensional
LlaMA 2-70B: 8192-dimensional
GPT-4.5: 3072-dimensional
Map
Word embeddings
35
Index | the | cat | sat | on | The | mat |
1 | 2.4249 × 10-2 | 4.7426 × 10-2 | -8.1093 × 10-2 | -2.0204 × 10-2 | 2.4249 × 10-2 | -7.3271 × 10-2 |
2 | 4.8235 × 10-3 | -4.2203 × 10-2 | -6.3660 × 10-2 | -6.6599 × 10-2 | 4.8235 × 10-3 | -8.9358 × 10-2 |
3 | 1.8411 × 10-2 | 2.8491 × 10-2 | 3.6398 × 10-2 | 2.5621 × 10-2 | 1.8411 × 10-2 | 1.0698 × 10-1 |
4 | 1.1867 × 10-2 | -4.4481 × 10-2 | 3.3358 × 10-2 | 1.7926 × 10-2 | 1.1867 × 10-2 | 1.1881 × 10-1 |
5 | 1.9167 × 10-2 | -2.6467 × 10-2 | 1.1537 × 10-1 | 8.4262 × 10-23 | 1.9167 × 10-2 | 5.6133 × 10-3 |
| | | | | | |
| | | | | | |
300 | 3.4856× 10-3 | 1.3847× 10-1 | 9.3088× 10-2 | 5.2301 × 10-3 | | 1.8160 × 10-1 |
Embedding vector
Tokens
Word2Vec model = “fasttext-wiki-news-subwords-300”
Code example
Working with word embeddings
36
Transformers
Creating context aware embeddings
37
Transforming embeddings
38
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Predict the next word
?
Input vector from preceding token into multi-class classifier
Attention
39
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Input embeddings
Updated embeddings
Self-Attention
40
Self-Attention
41
Self-Attention
42
Self-Attention
43
Attention: Masking
44
| the | cat | sat | on | the |
the | 1.0 | 0 | 0 | 0 | 0 |
cat | 0.2 | 0.8 | 0 | 0 | 0 |
sat | 0.35 | 0.15 | 0.5 | 0 | 0 |
on | 0.08 | 0.12 | 0.34 | 0.46 | 0 |
the | 0.2 | 0.25 | 0.30 | 0.05 | 0.2 |
Attention Head
45
|
|
|
|
|
|
|
|
|
|
Attention Head
Input embedding vector for token
Output embedding vector for token
Multi-headed Self-Attention
46
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Attention Head
Input embedding vector for token
Attention Head
Attention Head
|
|
|
|
|
Output embedding vector for token
Concatenation Operator, e.g. take average
Neural Network Sub-Layer
47
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Input embeddings
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Output embeddings
Multi-headed attention
Neural Network Sub-Layer
48
|
|
|
|
|
Embedding vector
Neural Network
|
|
|
|
|
Output Embedding vector
Parameters of neural network are learnt as part of the overall model training
The Transformer Block
49
Input embeddings
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Multi-headed attention
Neural Network
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Output embeddings
Transformer Block
What are these gaps?
Technical transformations
Output vector = Input vector + Transformation(Input vector)
50
The Transformer Block
51
Input embeddings
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Transformer Block
Neural Network
Layer Normalization
Layer Normalization
Multi-headed attention
Output embeddings
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
The Transformer Model
52
Input embeddings
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Output embeddings
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Transformer Block
Transformer Block
Transformer Block
Transformer Block
Transformer Block
Transformer Block
Back to predicting the next word
Back to our goal
53
Predicting the next word from context embeddings
54
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Predict the next word
?
Input vector from preceding token into multi-class classifier
Training
55
Predicting the next word from context embeddings
56
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
“mat”
Transformer Model
cat | sat | on | the | mat |
| | | | |
| | | | |
| | | | |
| | | | |
cat | sat | on | the | mat |
| | | | |
| | | | |
| | | | |
| | | | |
<eos>
Positional encoding
57
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
sat | cat | the | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
The input embeddings, and hence the context embeddings don’t take into account the position of the token in the context (other than through masking) !!
Positional encoding
58
the |
|
|
|
|
cat |
|
|
|
|
sat |
|
|
|
|
on |
|
|
|
|
the |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
the |
|
|
|
|
|
|
|
|
|
+
the @ i=1 |
|
|
|
|
=
Positional encoding
59
the |
|
|
|
|
cat |
|
|
|
|
sat |
|
|
|
|
on |
|
|
|
|
the |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
cat |
|
|
|
|
|
|
|
|
|
+
cat @ i=2 |
|
|
|
|
=
Positional encoding
60
the |
|
|
|
|
cat |
|
|
|
|
sat |
|
|
|
|
on |
|
|
|
|
the |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
the |
|
|
|
|
|
|
|
|
|
+
the @ i=5 |
|
|
|
|
=
Positional encoding
61
Positional encoding
62
Input embeddings
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Output embeddings
the | cat | sat | on | the |
| | | | |
| | | | |
| | | | |
| | | | |
Transformer Block
Transformer Block
Transformer Block
Transformer Block
Transformer Block
Transformer Block
Position encodings
i=1 | i=2 | i=3 | i=4 | i=5 |
| | | | |
| | | | |
| | | | |
| | | | |
Different types of transformers
Encoders, Decoders, Cross-attention
63
Different types of transformers
64
|
|
|
|
|
“mat”
Different types of transformers
65
|
|
|
|
|
“mat”
Encoder model, without masking
Decoder model, with masking
Different types of transformers
66
“the cat is on the table” → “le chat est sur la table”
Encoder model, without masking
Decoder model, with masking
‘Attention is all you need’, Vaswani et al, 2017 , arXiv:1706.03762
Different types of transformers
67
“the cat is on the table” → “le chat est sur la table”
Encoder model, without masking
Decoder model, with masking
Encoder model
Different types of transformers
68
“the cat is on the table” → “le chat est sur la table”
Encoder model, without masking
Decoder model, with masking
Decoder model
Different types of transformers
69
“the cat is on the table” → “le chat est sur la table”
Encoder model, without masking
Decoder model, with masking
Cross attention
Code example
Using PyTorch to build transformer layers
70
Wrap up
71