1 of 71

1

2 of 71

Outline

2

01	What we will learn
02	Predicting the next word
03	Simple approaches
04	Word embeddings
05	Transformers
06	Back to predicting the next word
07	Different types of transformers
08	Wrap-up

3 of 71

Outline

3

01	What we will learn
02	Predicting the next word
03	Simple approaches
04	Word embeddings
05	Transformers
06	Back to predicting the next word
07	Different types of transformers
08	Wrap-up

Break

4 of 71

Outline

4

01	What we will learn
02	Predicting the next word
03	Simple approaches
04	Word embeddings
05	Transformers
06	Back to predicting the next word
07	Different types of transformers
08	Wrap-up

Code example

5 of 71

What we will learn

The mathematical challenges of next word prediction

The basic mathematical ideas behind Transformers

How Transformers predict words and sentences

To get comfortable at a high-level with how LLMs work at a mathematical level

To lose your fear of the science of LLMs

5

6 of 71

What we will not learn

How to train foundational models

How to build transformer-based chatbots

How to prompt

Agentic AI

How to make $$$Millions from AI

6

7 of 71

What I assume you know

Basic probability, e.g. conditional probability

Basic linear algebra, e.g. matrix multiplication as a transformation

Basic machine learning, e.g. training a model by minimizing a loss function

7

8 of 71

Predicting the next word

What this enables, tokens and sequences

8

9 of 71

Predicting the next word

What are we trying to achieve?

The cat sat on the ….

9

10 of 71

Predicting the next word

What are we trying to achieve?

The cat sat on the mat

10

11 of 71

Predicting the next word

What are we trying to achieve?

The quick brown fox jumped over the …….

11

12 of 71

Predicting the next word

What are we trying to achieve?

The quick brown fox jumped over the lazy

The quick brown fox jumped over the lazy …..

12

13 of 71

Predicting the next word

What are we trying to achieve?

The quick brown fox jumped over the lazy

The quick brown fox jumped over the lazy dog

13

14 of 71

Predicting the next word

14

15 of 71

Predicting the next word

What are we trying to achieve?

“I wandered lonely as a cloud that floats on high o'er vales and hills,

when all at once …….”

→ “I wandered lonely as a cloud that floats on high o'er vales and hills, when all at once I saw a crowd, a host of golden daffodils.”

15

William Wordsworth, 1807

16 of 71

Words as tokens

Words are the sentence natural units of a sentence.

We breaks down a sentence into these units:

the cat sat on the mat → [the, cat, sat, on, the mat]

We call these units “tokens”

Tokens don’t have to be complete words:

quick brown fox jumped over → [quick, brown, fox, jump, ed, over]

16

17 of 71

Sequences

A natural language sentence is just a sequence of tokens

Our modelling challenge is “predict the next token in given the preceding sequence of tokens”

We find sequences of tokens everywhere, not just in language

I will almost always be referring to words in a sentence

17

18 of 71

Simple approaches

Naïve models that are wrong and what we can learn from that

18

19 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

19

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

From

To

20 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

20

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more

21 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

21

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more you

22 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

22

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more you know

23 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

23

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more you know the

24 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

24

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more you know the more

25 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

25

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more you know the more you

26 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

26

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more you know the more you realize

27 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

27

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more you know the more you realize you

28 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

28

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more you know the more you realize you don’t

29 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

29

	the	more	you	know	realize	don’t
the	0.00	0.96	0.01	0.01	0.01	0.01
more	0.02	0.00	0.9	0.01	0.03	0.04
you	0.05	0.05	0.00	0.3	0.3	0.3
know	0.9	0.02	0.03	0.00	0.03	0.02
realize	0.05	0.1	0.7	0.1	0.00	0.05
don’t	0.01	0.01	0.14	0.5	0.34	0.00

The more you know the more you realize you don’t know

30 of 71

Simple approaches: Markov models

Each separate word corresponds to a column in the transition matrix:

“know” = ( 0,0,0,1,0,0)

Small fixed vocabulary ⇒ Moving six-dimensional discrete state-space

Model is probabilistic:

P(“know” | “you”) = 0.3 , P(“realize” | “you”) = 0.3 , P(“don’t” | “you”) = 0.3

Could easily have generated,

‘The more you don’t the more you realize you don’t realize’

30

31 of 71

Simple approaches: Markov models

Problems:

1. Our vocabulary is large, e.g. 170000 – 700000 words

Large transition matrix

2a. We only take into account the preceding word

→ “Many adults drink beer” , “Many newborn babies drink beer”

“beer” has the same probability of occurring after the word “drink” in this model

2b. Expanding to transitions between pairs of words increases state space

e.g. (many, adults) → (adults, drink) → (drink, beer)

31

32 of 71

Word embeddings

Correcting our mistakes

32

33 of 71

Word embeddings

Reduce to a fixed lower dimensionality

33

“the” = (1,0,0,0,0,0)

“more” = (0,1,0,0,0,0)

“you” = (0,0,1,0,0,0)

“know” = (0,0,0,1,0,0)

“realize” = (0,0,0,0,1,0)

“don’t” = (0,0,0,0,0,1)

6-dimensional

3-dimensional

the

more

you

realize

don’t

know

Map

34 of 71

Word embeddings

Reduce to a fixed lower dimensionality

34

All of English language

170000 – 700000 dimensional

Word2Vec: 300-dimensional

GPT-3 (Davinci): 12288-dimensional

LlaMA 2-70B: 8192-dimensional

GPT-4.5: 3072-dimensional

Map

35 of 71

Word embeddings

35

Index	the	cat	sat	on	The	mat
1	2.4249 × 10^-2	4.7426 × 10^-2	-8.1093 × 10^-2	-2.0204 × 10^-2	2.4249 × 10^-2	-7.3271 × 10^-2
2	4.8235 × 10^-3	-4.2203 × 10^-2	-6.3660 × 10^-2	-6.6599 × 10^-2	4.8235 × 10^-3	-8.9358 × 10^-2
3	1.8411 × 10^-2	2.8491 × 10^-2	3.6398 × 10^-2	2.5621 × 10^-2	1.8411 × 10^-2	1.0698 × 10^-1
4	1.1867 × 10^-2	-4.4481 × 10^-2	3.3358 × 10^-2	1.7926 × 10^-2	1.1867 × 10^-2	1.1881 × 10^-1
5	1.9167 × 10^-2	-2.6467 × 10^-2	1.1537 × 10^-1	8.4262 × 10^-23	1.9167 × 10^-2	5.6133 × 10^-3


300	3.4856× 10^-3	1.3847× 10^-1	9.3088× 10^-2	5.2301 × 10^-3		1.8160 × 10^-1

Embedding vector

Tokens

Word2Vec model = “fasttext-wiki-news-subwords-300”

36 of 71

Code example

Working with word embeddings

36

37 of 71

Transformers

Creating context aware embeddings

37

38 of 71

Transforming embeddings

38

the	cat	sat	on	the

Predict the next word

?

Input vector from preceding token into multi-class classifier

Each token knows nothing about the broader context of the preceding words

Need new vectors for each token that take into account the preceding context

We will make a Transformer

For now, create new vectors that take into account the context of the whole sentence

39 of 71

Attention

39

the	cat	sat	on	the

the	cat	sat	on	the

Input embeddings

Updated embeddings

40 of 71

Self-Attention

40

41 of 71

Self-Attention

41

42 of 71

Self-Attention

42

We can add more flexibility to model.

Different linear mappings to create the query vectors and to create the key vectors

43 of 71

Self-Attention

43

44 of 71

Attention: Masking

44

	the	cat	sat	on	the
the	1.0	0	0	0	0
cat	0.2	0.8	0	0	0
sat	0.35	0.15	0.5	0	0
on	0.08	0.12	0.34	0.46	0
the	0.2	0.25	0.30	0.05	0.2

45 of 71

Attention Head

45

Attention Head

Input embedding vector for token

Output embedding vector for token

46 of 71

Multi-headed Self-Attention

46

Attention Head

Input embedding vector for token

Attention Head

Output embedding vector for token

Concatenation Operator, e.g. take average

We solve this by having multiple (H) self-attention heads.
We combine the output vectors of all the attention heads into a single vector
Typically, we make output dimension of each attention head d/H

47 of 71

Neural Network Sub-Layer

47

the	cat	sat	on	the

Input embeddings

the	cat	sat	on	the

Output embeddings

Multi-headed attention

Attention process is column-wise (token-to-token)

We haven’t allowed the different rows within a column to interact when producing the updated embedding

48 of 71

Neural Network Sub-Layer

48

Embedding vector

Neural Network

Output Embedding vector

Parameters of neural network are learnt as part of the overall model training

49 of 71

The Transformer Block

49

Input embeddings

the	cat	sat	on	the

Multi-headed attention

Neural Network

the	cat	sat	on	the

Output embeddings

Transformer Block

What are these gaps?

50 of 71

Technical transformations

We apply two main additional transformations to the data to improve efficiency and stability of the training process:

Layer-wise normalization – apply a standardization (location and scale) transformation before each sub-layer

We use the multi-headed attention and neural network transformations to model the residuals of the input-output relations

Output vector = Input vector + Transformation(Input vector)

50

51 of 71

The Transformer Block

51

Input embeddings

the	cat	sat	on	the

Transformer Block

Neural Network

Layer Normalization

Multi-headed attention

Output embeddings

the	cat	sat	on	the

52 of 71

The Transformer Model

52

Input embeddings

the	cat	sat	on	the

Output embeddings

the	cat	sat	on	the

Transformer Block

53 of 71

Back to predicting the next word

Back to our goal

53

54 of 71

Predicting the next word from context embeddings

54

the	cat	sat	on	the

Predict the next word

?

Input vector from preceding token into multi-class classifier

55 of 71

Training

55

56 of 71

Predicting the next word from context embeddings

56

the	cat	sat	on	the

“mat”

Transformer Model

cat	sat	on	the	mat

cat	sat	on	the	mat

<eos>

57 of 71

Positional encoding

57

Let’s look at our formula for the attention probabilities

the	cat	sat	on	the

sat	cat	the	on	the

The input embeddings, and hence the context embeddings don’t take into account the position of the token in the context (other than through masking) !!

58 of 71

Positional encoding

58

We solve by adding position dependent vectors to the input

the

cat

sat

on

the

the

+

the @ i=1

=

59 of 71

Positional encoding

59

We solve by adding position dependent vectors to the input

the

cat

sat

on

the

cat

+

cat @ i=2

=

60 of 71

Positional encoding

60

We solve by adding position dependent vectors to the input

the

cat

sat

on

the

the

+

the @ i=5

=

61 of 71

Positional encoding

61

62 of 71

Positional encoding

62

Putting it all together (finally!)

Input embeddings

the	cat	sat	on	the

Output embeddings

the	cat	sat	on	the

Transformer Block

Position encodings

i=1	i=2	i=3	i=4	i=5

63 of 71

Different types of transformers

Encoders, Decoders, Cross-attention

63

64 of 71

Different types of transformers

What we covered was embedding vector → token (word)

We ‘decoded’ the embedding vector

ChatGPT, Claude, etc are known as ‘decoder-only’ models

64

“mat”

65 of 71

Different types of transformers

We also have encoders: token → context embedding vector

65

“mat”

Encoders are useful where we have all the tokens (no prediction), we just need to map them to appropriate vectors

E.g. “the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

66 of 71

Different types of transformers

66

“the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

‘Attention is all you need’, Vaswani et al, 2017 , arXiv:1706.03762

67 of 71

Different types of transformers

67

“the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

Encoder model

68 of 71

Different types of transformers

68

“the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

Decoder model

69 of 71

Different types of transformers

69

“the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

Cross attention

70 of 71

Code example

Using PyTorch to build transformer layers

70

71 of 71

Wrap up

71