7 of 68

Self-Attention

Input vectors

The

brown

dog

ran

z₁

Output representation

z₂

z₃

z₄

??????

x₁

x₂

x₃

x₄

Self-Attention’s goal is to create great representations, z_i, of the input

8 of 68

Self-Attention

The

brown

dog

ran

z₁

Output representation

z₁ will be based on a weighted contribution of x₁, x₂, x₃, x₄

x₁

x₂

x₃

x₄

Input vectors

Self-Attention’s goal is to create great representations, z_i, of the input

9 of 68

Self-Attention

The

brown

dog

ran

z₁

Output representation

z₁ will be based on a weighted contribution of x₁, x₂, x₃, x₄

x₁

x₂

x₃

x₄

Input vectors

Self-Attention’s goal is to create great representations, z_i, of the input

10 of 68

Self-Attention

The

brown

dog

ran

z₁

Output representation

x₁

x₂

x₃

x₄

Input vectors

Under the hood, each x_i has 3 small, associated vectors. For example, x₁ has:

Query q_i
Key k_i
Value v_i

11 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

Under the hood, each x_i has 3 small, associated vectors. For example, x₁ has:

Query q₁
Key k₁
Value v₁

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 1: Our Self-Attention Head has just 3 weight matrices W_q, W_k, W_v in total. These same 3 weight matrices are multiplied by each x_ito create all vectors:

q_i= w_qx_i

k_i= w_kx_i

v_i= w_vx_i

12 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₁, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

13 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₁, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

14 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₁, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

15 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₁, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

16 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

17 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Instead of these a_i values directly weighting our original x_i word vectors, they directly weight our v_i vectors.

18 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 4: Let’s weight our v_i vectors and simply sum them up!

z₁

19 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 5: We repeat this for all other words, yielding us with great, new z_i representations!

z₂

20 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 5: We repeat this for all other words, yielding us with great, new z_i representations!

z₃

21 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 5: We repeat this for all other words, yielding us with great, new z_i representations!

z₄

22 of 68

Let’s illustrate another example:

z₂

Remember, we use the same 3 weight matrices W_q, W_k, W_v as we did for computing z₁.

This gives us q₂_,k_2,v₂

23 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

Under the hood, each x_i has 3 small, associated vectors. For example, x₁ has:

Query q₁
Key k₁
Value v₁

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 1: Our Self-Attention Head l has just 3 weight matrices W_q, W_k, W_v in total. These same 3 weight matrices are multiplied by each x_ito create all vectors:

q_i= w_qx_i

k_i= w_kx_i

v_i= w_vx_i

24 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₂, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

25 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₂, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

26 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₂, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

27 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₂, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

28 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

29 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Instead of these a_i values directly weighting our original x_i word vectors, they directly weight our v_i vectors.

30 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 4: Let’s weight our v_i vectors and simply sum them up!

z₂

31 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Tada! Now we have great, new representations z_i via a self-attention head

z₄

z₃

z₂

z₁

Self-attention Head

32 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Tada! Now we have great, new representations z_i via a self-attention head

z₄

z₃

z₂

z₁

Takeaway:

�Self-Attention allows us to create great, context-aware representations

Self-Attention’s outputs (z) and inputs (x) have the same shape

34 of 68

Introduction to Large Language Models System

Zirui “Ray” Liu

University of Minnesota, Twin Cities

Transformer Architecture

35 of 68

Outline

Recap Self-Attention
Layer Normalization
Feedforward Network

36 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

Under the hood, each x_i has 3 small, associated vectors. For example, x₁ has:

Query q₁
Key k₁
Value v₁

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 1: Our Self-Attention Head l has just 3 weight matrices W_q, W_k, W_v in total. These same 3 weight matrices are multiplied by each x_ito create all vectors:

q_i= w_qx_i

k_i= w_kx_i

v_i= w_vx_i

37 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₂, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

38 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₂, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

39 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₂, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

40 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 2: For word x₂, let’s calculate the scores s₁, s₂, s₃, s₄, which represent how much attention to pay to each respective ”word” v_i

41 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

42 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Instead of these a_i values directly weighting our original x_i word vectors, they directly weight our v_i vectors.

43 of 68

Self-Attention

The

brown

dog

ran

x₁

x₂

x₃

x₄

q₂

k₂

v₂

q₁

k₁

v₁

q₃

k₃

v₃

q₄

k₄

v₄

Step 4: Let’s weight our v_i vectors and simply sum them up!

z₂

44 of 68

Outline

Recap Self-Attention
Layer Normalization
Feedforward Network

45 of 68

Layer Normalization

Normalize

within

each

word

embedding

46 of 68

Outline

Recap Self-Attention
Layer Normalization
Feedforward Network

47 of 68

Feedforward Network

Apply to each word embedding separately

48 of 68

Transformer Decoder

<s>

perro

marrón

x₁

x₂

x₃

x₄

Masked Self-attention Head

Decoder

r₂

r₃

FFNN

r₄

r₁

+ xResidual Connections +LayerNorm

z_1A

z_1B

z_1C

z_2A

z_2B

z_2C

z_3A

z_3B

z_3C

z_4A

z_4B

z_4C

49 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x₁

x₂

x₃

x₄

Transformer Encoders produce contextualized embeddings of each word

Encoder #1

Encoder #2

Encoder #8

Transformer Decoders generate new sequences of text

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

50 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x₁

x₂

x₃

x₄

Encoder #1

Encoder #2

Encoder #8

Transformer Decoders are identical to the Encoders, except they have an additional Attention Head in between the Self-Attention and FFNN layers.

This additional Attention Head focuses on parts of the encoder’s representations.

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

NOTE

51 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x₁

x₂

x₃

x₄

Encoder #1

Encoder #2

Encoder #8

The query vector for a Transformer Decoder’s Attention Head (not Self-Attention Head) is from the output of the previous decoder layer.

However, the key and value vectors are from the Transformer Encoders’ outputs.

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

NOTE

52 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x₁

x₂

x₃

x₄

Encoder #1

Encoder #2

Encoder #8

The query, key, and value vectors for a Transformer Decoder’s Self-Attention Head (not Attention Head) are all from the output of the previous decoder layer.

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

NOTE

53 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x₁

x₂

x₃

x₄

Encoder #1

Encoder #2

Encoder #8

The Transformer Decoders have positional embeddings, too, just like the Encoders.

Critically, each position is only allowed to attend to the previous indices. This masked Attention preserves it as being an auto-regressive LM.

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

IMPORTANT

54 of 68

https://jalammar.github.io/illustrated-transformer/

55 of 68

https://jalammar.github.io/illustrated-transformer/

56 of 68

https://jalammar.github.io/illustrated-transformer/

57 of 68

OpenAI API Price (128K context length)

Gemini-1.5 Flash API Price

58 of 68

https://arxiv.org/pdf/1706.03762.pdf

59 of 68

Machine Translation results: state-of-the-art (at the time)

60 of 68

Machine Translation results: state-of-the-art (at the time)

You can train to translate from Language A to Language B.

Then train it to translate from Language B. to Language C.

Then, without training, it can translate from Language A to Language C

61 of 68

Introduction to Large Language Models System

Zirui “Ray” Liu

University of Minnesota, Twin Cities

Transformer++

62 of 68

Layer-Norm -> RMS-Norm
GeLU -> SwiGLU
Vanilla FFN -> Gated MLP
Absolute Pos embed-> Rotary embedding

Llama’s change over Architecture

63 of 68

RMS-Norm

Layer-Norm: (re-centering + re-scaling)

RMS-Norm: (re-scaling)

64 of 68

RMS-Norm

RMS-Norm is simplified Layer-Norm
Significantly improve training throughput

65 of 68

RMS-Norm

GPU basic:

I/O from HBM: 400~600 Cycles
I/O from L2 Cache: ~50 Cycles
Multiply-and-Add: < 5 Cycle

66 of 68

RMS-Norm

Why faster?

Three-pass over x

Two-pass

over x

67 of 68

GeLU & SwiGLU

Smoother

68 of 68

LLama FFN

Original Transformer

Llama

Up proj

Down proj

GeLU

Up proj

Down proj

SwiLU

Gate proj

1 of 68

2 of 68

3 of 68

4 of 68

5 of 68

6 of 68

7 of 68

8 of 68

9 of 68

10 of 68

11 of 68

12 of 68

13 of 68

14 of 68

15 of 68

16 of 68

17 of 68

18 of 68

19 of 68

20 of 68

21 of 68

22 of 68

23 of 68

24 of 68

25 of 68

26 of 68

27 of 68

28 of 68

29 of 68

30 of 68

31 of 68

32 of 68

33 of 68

34 of 68

35 of 68

36 of 68

37 of 68

38 of 68

39 of 68

40 of 68

41 of 68

42 of 68

43 of 68

44 of 68

45 of 68

46 of 68

47 of 68

48 of 68

49 of 68

50 of 68

51 of 68

52 of 68

53 of 68

54 of 68

55 of 68

56 of 68

57 of 68

58 of 68

59 of 68

60 of 68

61 of 68

62 of 68

63 of 68

64 of 68

65 of 68

66 of 68

67 of 68

68 of 68