1 of 71

1

2 of 71

Outline

2

01

What we will learn

02

Predicting the next word

03

Simple approaches

04

Word embeddings

05

Transformers

06

Back to predicting the next word

07

Different types of transformers

08

Wrap-up

3 of 71

Outline

3

01

What we will learn

02

Predicting the next word

03

Simple approaches

04

Word embeddings

05

Transformers

06

Back to predicting the next word

07

Different types of transformers

08

Wrap-up

Break

4 of 71

Outline

4

01

What we will learn

02

Predicting the next word

03

Simple approaches

04

Word embeddings

05

Transformers

06

Back to predicting the next word

07

Different types of transformers

08

Wrap-up

Code example

Code example

5 of 71

What we will learn

  • The mathematical challenges of next word prediction

  • The basic mathematical ideas behind Transformers

  • How Transformers predict words and sentences

  • To get comfortable at a high-level with how LLMs work at a mathematical level

  • To lose your fear of the science of LLMs

5

6 of 71

What we will not learn

  • How to train foundational models

  • How to build transformer-based chatbots

  • How to prompt

  • Agentic AI

  • How to make $$$Millions from AI

6

7 of 71

What I assume you know

  • Basic probability, e.g. conditional probability

  • Basic linear algebra, e.g. matrix multiplication as a transformation

  • Basic machine learning, e.g. training a model by minimizing a loss function

7

8 of 71

Predicting the next word

What this enables, tokens and sequences

8

9 of 71

Predicting the next word

  • What are we trying to achieve?

    • The cat sat on the ….

9

10 of 71

Predicting the next word

  • What are we trying to achieve?

    • The cat sat on the mat

10

11 of 71

Predicting the next word

  • What are we trying to achieve?

    • The quick brown fox jumped over the …….

11

12 of 71

Predicting the next word

  • What are we trying to achieve?

    • The quick brown fox jumped over the lazy

    • The quick brown fox jumped over the lazy …..

12

13 of 71

Predicting the next word

  • What are we trying to achieve?

    • The quick brown fox jumped over the lazy

    • The quick brown fox jumped over the lazy dog

13

14 of 71

Predicting the next word

  •  

14

15 of 71

Predicting the next word

  • What are we trying to achieve?

“I wandered lonely as a cloud that floats on high o'er vales and hills,

when all at once …….”

→ “I wandered lonely as a cloud that floats on high o'er vales and hills, when all at once I saw a crowd, a host of golden daffodils.”

15

William Wordsworth, 1807

16 of 71

Words as tokens

  • Words are the sentence natural units of a sentence.

  • We breaks down a sentence into these units:

the cat sat on the mat → [the, cat, sat, on, the mat]

  • We call these units “tokens”

  • Tokens don’t have to be complete words:

quick brown fox jumped over → [quick, brown, fox, jump, ed, over]

16

17 of 71

Sequences

  • A natural language sentence is just a sequence of tokens

  • Our modelling challenge is “predict the next token in given the preceding sequence of tokens”

  • We find sequences of tokens everywhere, not just in language

  • I will almost always be referring to words in a sentence

17

18 of 71

Simple approaches

Naïve models that are wrong and what we can learn from that

18

19 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

19

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

From

To

 

 

20 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

20

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more

 

 

21 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

21

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more you

 

 

22 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

22

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more you know

 

 

23 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

23

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more you know the

 

 

24 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

24

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more you know the more

 

 

25 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

25

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more you know the more you

 

 

26 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

26

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more you know the more you realize

 

 

27 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

27

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more you know the more you realize you

 

 

28 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

28

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more you know the more you realize you don’t

 

 

29 of 71

Simple approaches: Markov models

Example: “The more you know, the more you realize you don't know” Aristotle

29

the

more

you

know

realize

don’t

the

0.00

0.96

0.01

0.01

0.01

0.01

more

0.02

0.00

0.9

0.01

0.03

0.04

you

0.05

0.05

0.00

0.3

0.3

0.3

know

0.9

0.02

0.03

0.00

0.03

0.02

realize

0.05

0.1

0.7

0.1

0.00

0.05

don’t

0.01

0.01

0.14

0.5

0.34

0.00

 

The more you know the more you realize you don’t know

 

 

30 of 71

Simple approaches: Markov models

  • Each separate word corresponds to a column in the transition matrix:

“know” = ( 0,0,0,1,0,0)

  • Small fixed vocabulary ⇒ Moving six-dimensional discrete state-space

  • Model is probabilistic:

P(“know” | “you”) = 0.3 , P(“realize” | “you”) = 0.3 , P(“don’t” | “you”) = 0.3

Could easily have generated,

‘The more you don’t the more you realize you don’t realize’

30

31 of 71

Simple approaches: Markov models

Problems:

1. Our vocabulary is large, e.g. 170000 – 700000 words

    • Large transition matrix

2a. We only take into account the preceding word

→ “Many adults drink beer” , “Many newborn babies drink beer”

“beer” has the same probability of occurring after the word “drink” in this model

2b. Expanding to transitions between pairs of words increases state space

e.g. (many, adults) → (adults, drink) → (drink, beer)

31

32 of 71

Word embeddings

Correcting our mistakes

32

33 of 71

Word embeddings

  • Reduce to a fixed lower dimensionality

33

“the” = (1,0,0,0,0,0)

“more” = (0,1,0,0,0,0)

“you” = (0,0,1,0,0,0)

“know” = (0,0,0,1,0,0)

“realize” = (0,0,0,0,1,0)

“don’t” = (0,0,0,0,0,1)

6-dimensional

3-dimensional

the

more

you

realize

don’t

know

Map

34 of 71

Word embeddings

  • Reduce to a fixed lower dimensionality

34

All of English language

170000 – 700000 dimensional

Word2Vec: 300-dimensional

GPT-3 (Davinci): 12288-dimensional

LlaMA 2-70B: 8192-dimensional

GPT-4.5: 3072-dimensional

Map

35 of 71

Word embeddings

35

Index

the

cat

sat

on

The

mat

1

2.4249 × 10-2

4.7426 × 10-2

-8.1093 × 10-2

-2.0204 × 10-2

2.4249 × 10-2

-7.3271 × 10-2

2

4.8235 × 10-3

-4.2203 × 10-2

-6.3660 × 10-2

-6.6599 × 10-2

4.8235 × 10-3

-8.9358 × 10-2

3

1.8411 × 10-2

2.8491 × 10-2

3.6398 × 10-2

2.5621 × 10-2

1.8411 × 10-2

1.0698 × 10-1

4

1.1867 × 10-2

-4.4481 × 10-2

3.3358 × 10-2

1.7926 × 10-2

1.1867 × 10-2

1.1881 × 10-1

5

1.9167 × 10-2

-2.6467 × 10-2

1.1537 × 10-1

8.4262 × 10-23

1.9167 × 10-2

5.6133 × 10-3

300

3.4856× 10-3

1.3847× 10-1

9.3088× 10-2

5.2301 × 10-3

1.8160 × 10-1

Embedding vector

Tokens

Word2Vec model = “fasttext-wiki-news-subwords-300”

36 of 71

Code example

Working with word embeddings

36

37 of 71

Transformers

Creating context aware embeddings

37

38 of 71

Transforming embeddings

38

the

cat

sat

on

the

Predict the next word

?

Input vector from preceding token into multi-class classifier

  • Each token knows nothing about the broader context of the preceding words

  • Need new vectors for each token that take into account the preceding context

    • We will make a Transformer

  • For now, create new vectors that take into account the context of the whole sentence

39 of 71

Attention

39

the

cat

sat

on

the

the

cat

sat

on

the

Input embeddings

Updated embeddings

 

40 of 71

Self-Attention

40

 

41 of 71

Self-Attention

41

 

42 of 71

Self-Attention

42

  • We can add more flexibility to model.
    • Different linear mappings to create the query vectors and to create the key vectors

 

 

43 of 71

Self-Attention

  •  

43

44 of 71

Attention: Masking

  •  

44

the

cat

sat

on

the

the

1.0

0

0

0

0

cat

0.2

0.8

0

0

0

sat

0.35

0.15

0.5

0

0

on

0.08

0.12

0.34

0.46

0

the

0.2

0.25

0.30

0.05

0.2

 

45 of 71

Attention Head

45

Attention Head

 

Input embedding vector for token

Output embedding vector for token

46 of 71

Multi-headed Self-Attention

46

Attention Head

Input embedding vector for token

Attention Head

Attention Head

Output embedding vector for token

Concatenation Operator, e.g. take average

  • We solve this by having multiple (H) self-attention heads.
  • We combine the output vectors of all the attention heads into a single vector
  • Typically, we make output dimension of each attention head d/H

47 of 71

Neural Network Sub-Layer

47

the

cat

sat

on

the

Input embeddings

the

cat

sat

on

the

Output embeddings

Multi-headed attention

 

  • Attention process is column-wise (token-to-token)

  • We haven’t allowed the different rows within a column to interact when producing the updated embedding

48 of 71

Neural Network Sub-Layer

48

Embedding vector

Neural Network

Output Embedding vector

Parameters of neural network are learnt as part of the overall model training

49 of 71

The Transformer Block

49

Input embeddings

the

cat

sat

on

the

Multi-headed attention

Neural Network

the

cat

sat

on

the

Output embeddings

Transformer Block

What are these gaps?

50 of 71

Technical transformations

  • We apply two main additional transformations to the data to improve efficiency and stability of the training process:

    • Layer-wise normalization – apply a standardization (location and scale) transformation before each sub-layer

    • We use the multi-headed attention and neural network transformations to model the residuals of the input-output relations

Output vector = Input vector + Transformation(Input vector)

50

51 of 71

The Transformer Block

51

Input embeddings

the

cat

sat

on

the

Transformer Block

Neural Network

Layer Normalization

Layer Normalization

Multi-headed attention

Output embeddings

the

cat

sat

on

the

52 of 71

The Transformer Model

52

Input embeddings

the

cat

sat

on

the

Output embeddings

the

cat

sat

on

the

Transformer Block

Transformer Block

Transformer Block

Transformer Block

Transformer Block

Transformer Block

53 of 71

Back to predicting the next word

Back to our goal

53

54 of 71

Predicting the next word from context embeddings

54

the

cat

sat

on

the

Predict the next word

?

Input vector from preceding token into multi-class classifier

 

55 of 71

Training

  •  

55

 

56 of 71

Predicting the next word from context embeddings

56

the

cat

sat

on

the

“mat”

Transformer Model

cat

sat

on

the

mat

cat

sat

on

the

mat

<eos>

57 of 71

Positional encoding

57

 

  • Let’s look at our formula for the attention probabilities

the

cat

sat

on

the

sat

cat

the

on

the

 

The input embeddings, and hence the context embeddings don’t take into account the position of the token in the context (other than through masking) !!

58 of 71

Positional encoding

58

  • We solve by adding position dependent vectors to the input

the

 

cat

 

sat

 

on

 

the

 

 

 

 

 

 

the

 

 

+

the @ i=1

 

=

59 of 71

Positional encoding

59

  • We solve by adding position dependent vectors to the input

the

 

cat

 

sat

 

on

 

the

 

 

 

 

 

 

cat

 

 

+

cat @ i=2

 

=

60 of 71

Positional encoding

60

  • We solve by adding position dependent vectors to the input

the

 

cat

 

sat

 

on

 

the

 

 

 

 

 

 

the

 

 

+

the @ i=5

 

=

61 of 71

Positional encoding

61

 

 

62 of 71

Positional encoding

62

  • Putting it all together (finally!)

Input embeddings

the

cat

sat

on

the

Output embeddings

the

cat

sat

on

the

Transformer Block

Transformer Block

Transformer Block

Transformer Block

Transformer Block

Transformer Block

Position encodings

i=1

i=2

i=3

i=4

i=5

63 of 71

Different types of transformers

Encoders, Decoders, Cross-attention

63

64 of 71

Different types of transformers

  • What we covered was embedding vector → token (word)

  • We ‘decoded’ the embedding vector

  • ChatGPT, Claude, etc are known as ‘decoder-only’ models

64

“mat”

65 of 71

Different types of transformers

  • We also have encoders: token → context embedding vector

65

“mat”

  • Encoders are useful where we have all the tokens (no prediction), we just need to map them to appropriate vectors

    • E.g. “the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

66 of 71

Different types of transformers

66

“the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

‘Attention is all you need’, Vaswani et al, 2017 , arXiv:1706.03762

67 of 71

Different types of transformers

67

“the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

Encoder model

68 of 71

Different types of transformers

68

“the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

Decoder model

69 of 71

Different types of transformers

69

“the cat is on the table” → “le chat est sur la table”

Encoder model, without masking

Decoder model, with masking

Cross attention

70 of 71

Code example

Using PyTorch to build transformer layers

70

71 of 71

Wrap up

71