1 of 68

Introduction to Large Language Models System

Zirui “Ray” Liu

University of Minnesota, Twin Cities

Transformer Architecture

2 of 68

Zirui “Ray” Liu

University of Minnesota, Twin Cities

Transformer Architecture

3 of 68

4 of 68

5 of 68

Outline

  • Self-Attention
  • Layer Normalization
  • Feedforward Network

6 of 68

Goals

  • Each word in a sequence to be transformed into a rich, abstract representation (context embedding) based on the weighted sums of the other words in the same sequence
  • We want each word to determine, “how much should I be influenced by each of my neighbors”

7 of 68

Self-Attention

Input vectors

The

brown

dog

ran

z1

Output representation

z2

z3

z4

??????

x1

x2

x3

x4

Self-Attention’s goal is to create great representations, zi, of the input

8 of 68

Self-Attention

The

brown

dog

ran

z1

Output representation

z1 will be based on a weighted contribution of x1, x2, x3, x4

x1

x2

x3

x4

Input vectors

Self-Attention’s goal is to create great representations, zi, of the input

 

 

 

 

9 of 68

Self-Attention

The

brown

dog

ran

z1

Output representation

z1 will be based on a weighted contribution of x1, x2, x3, x4

x1

x2

x3

x4

Input vectors

Self-Attention’s goal is to create great representations, zi, of the input

 

 

 

 

 

10 of 68

Self-Attention

The

brown

dog

ran

z1

Output representation

x1

x2

x3

x4

Input vectors

 

 

 

 

Under the hood, each xi has 3 small, associated vectors. For example, x1 has:

  • Query qi
  • Key ki
  • Value vi

11 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

Under the hood, each xi has 3 small, associated vectors. For example, x1 has:

  • Query q1
  • Key k1
  • Value v1

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

Step 1: Our Self-Attention Head has just 3 weight matrices Wq, Wk, Wv in total. These same 3 weight matrices are multiplied by each xi to create all vectors:

qi = wq xi

ki = wk xi

vi = wv xi

12 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

Step 2: For word x1, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

13 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

Step 2: For word x1, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

14 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

Step 2: For word x1, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

15 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

 

Step 2: For word x1, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

16 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

 

 

 

 

 

 

17 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

 

 

 

 

 

 

Instead of these ai values directly weighting our original xi word vectors, they directly weight our vi vectors.

18 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

Step 4: Let’s weight our vi vectors and simply sum them up!

 

z1

19 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

Step 5: We repeat this for all other words, yielding us with great, new zi representations!

z2

20 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

Step 5: We repeat this for all other words, yielding us with great, new zi representations!

z3

21 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

Step 5: We repeat this for all other words, yielding us with great, new zi representations!

z4

22 of 68

Let’s illustrate another example:

 

z2

Remember, we use the same 3 weight matrices Wq, Wk, Wv as we did for computing z1.

This gives us q2, k2, v2

23 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

Under the hood, each xi has 3 small, associated vectors. For example, x1 has:

  • Query q1
  • Key k1
  • Value v1

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

Step 1: Our Self-Attention Head l has just 3 weight matrices Wq, Wk, Wv in total. These same 3 weight matrices are multiplied by each xi to create all vectors:

qi = wq xi

ki = wk xi

vi = wv xi

24 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

25 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

26 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

27 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

 

Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

28 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

 

 

 

 

 

 

 

29 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

 

 

 

Instead of these ai values directly weighting our original xi word vectors, they directly weight our vi vectors.

 

 

 

 

30 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

Step 4: Let’s weight our vi vectors and simply sum them up!

 

z2

31 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

Tada! Now we have great, new representations zi via a self-attention head

z4

z3

z2

z1

Self-attention Head

32 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

Tada! Now we have great, new representations zi via a self-attention head

z4

z3

z2

z1

Takeaway:

Self-Attention allows us to create great, context-aware representations

Self-Attention’s outputs (z) and inputs (x) have the same shape

33 of 68

34 of 68

Introduction to Large Language Models System

Zirui “Ray” Liu

University of Minnesota, Twin Cities

Transformer Architecture

35 of 68

Outline

  • Recap Self-Attention
  • Layer Normalization
  • Feedforward Network

36 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

Under the hood, each xi has 3 small, associated vectors. For example, x1 has:

  • Query q1
  • Key k1
  • Value v1

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

Step 1: Our Self-Attention Head l has just 3 weight matrices Wq, Wk, Wv in total. These same 3 weight matrices are multiplied by each xi to create all vectors:

qi = wq xi

ki = wk xi

vi = wv xi

37 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

38 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

39 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

40 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

 

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

 

Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi

41 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

 

 

 

 

 

 

 

42 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

 

 

 

 

Instead of these ai values directly weighting our original xi word vectors, they directly weight our vi vectors.

 

 

 

 

43 of 68

Self-Attention

The

brown

dog

ran

x1

x2

x3

x4

q2

k2

v2

q1

k1

v1

q3

k3

v3

q4

k4

v4

 

Step 4: Let’s weight our vi vectors and simply sum them up!

 

z2

44 of 68

Outline

  • Recap Self-Attention
  • Layer Normalization
  • Feedforward Network

45 of 68

Layer Normalization

Normalize

within

each

word

embedding

46 of 68

Outline

  • Recap Self-Attention
  • Layer Normalization
  • Feedforward Network

47 of 68

Feedforward Network

  • Apply to each word embedding separately

48 of 68

Transformer Decoder

<s>

El

perro

marrón

x1

x2

x3

x4

Masked Self-attention Head

Decoder

r2

r3

FFNN

r4

r1

+ x Residual Connections +LayerNorm

z1A

z1B

z1C

z2A

z2B

z2C

z3A

z3B

z3C

z4A

z4B

z4C

=

49 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x1

x2

x3

x4

Transformer Encoders produce contextualized embeddings of each word

Encoder #1

Encoder #2

Encoder #8

Transformer Decoders generate new sequences of text

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

50 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x1

x2

x3

x4

Encoder #1

Encoder #2

Encoder #8

Transformer Decoders are identical to the Encoders, except they have an additional Attention Head in between the Self-Attention and FFNN layers.

This additional Attention Head focuses on parts of the encoder’s representations.

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

NOTE

51 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x1

x2

x3

x4

Encoder #1

Encoder #2

Encoder #8

The query vector for a Transformer Decoder’s Attention Head (not Self-Attention Head) is from the output of the previous decoder layer.

However, the key and value vectors are from the Transformer Encoders’ outputs.

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

NOTE

52 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x1

x2

x3

x4

Encoder #1

Encoder #2

Encoder #8

The query, key, and value vectors for a Transformer Decoder’s Self-Attention Head (not Attention Head) are all from the output of the previous decoder layer.

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

NOTE

53 of 68

Transformer Encoders and Decoders

The

brown

dog

ran

x1

x2

x3

x4

Encoder #1

Encoder #2

Encoder #8

The Transformer Decoders have positional embeddings, too, just like the Encoders.

Critically, each position is only allowed to attend to the previous indices. This masked Attention preserves it as being an auto-regressive LM.

Decoder #1

Decoder #2

Decoder #8

hnědý

pes

běžel

Transformer

IMPORTANT

54 of 68

https://jalammar.github.io/illustrated-transformer/

55 of 68

https://jalammar.github.io/illustrated-transformer/

56 of 68

https://jalammar.github.io/illustrated-transformer/

57 of 68

  • OpenAI API Price (128K context length)
  • Gemini-1.5 Flash API Price

58 of 68

 

https://arxiv.org/pdf/1706.03762.pdf

59 of 68

Machine Translation results: state-of-the-art (at the time)

:

60 of 68

Machine Translation results: state-of-the-art (at the time)

You can train to translate from Language A to Language B.

Then train it to translate from Language B. to Language C.

Then, without training, it can translate from Language A to Language C

61 of 68

Introduction to Large Language Models System

Zirui “Ray” Liu

University of Minnesota, Twin Cities

Transformer++

62 of 68

  • Layer-Norm -> RMS-Norm
  • GeLU -> SwiGLU
  • Vanilla FFN -> Gated MLP
  • Absolute Pos embed-> Rotary embedding

Llama’s change over Architecture

63 of 68

RMS-Norm

  • Layer-Norm: (re-centering + re-scaling)

  • RMS-Norm: (re-scaling)

64 of 68

RMS-Norm

  • RMS-Norm is simplified Layer-Norm
  • Significantly improve training throughput

65 of 68

RMS-Norm

  • GPU basic:
    • I/O from HBM: 400~600 Cycles
    • I/O from L2 Cache: ~50 Cycles
    • Multiply-and-Add: < 5 Cycle

66 of 68

RMS-Norm

  • Why faster?

Three-pass over x

Two-pass

over x

67 of 68

GeLU & SwiGLU

  • Smoother

68 of 68

LLama FFN

Original Transformer

Llama

Up proj

Down proj

GeLU

Up proj

Down proj

SwiLU

Gate proj