Introduction to Large Language Models System
Zirui “Ray” Liu
University of Minnesota, Twin Cities
Transformer Architecture
Zirui “Ray” Liu
University of Minnesota, Twin Cities
Transformer Architecture
Outline
Goals
Self-Attention
Input vectors
The
brown
dog
ran
z1
Output representation
z2
z3
z4
??????
x1
x2
x3
x4
Self-Attention’s goal is to create great representations, zi, of the input
Self-Attention
The
brown
dog
ran
z1
Output representation
z1 will be based on a weighted contribution of x1, x2, x3, x4
x1
x2
x3
x4
Input vectors
Self-Attention’s goal is to create great representations, zi, of the input
Self-Attention
The
brown
dog
ran
z1
Output representation
z1 will be based on a weighted contribution of x1, x2, x3, x4
x1
x2
x3
x4
Input vectors
Self-Attention’s goal is to create great representations, zi, of the input
Self-Attention
The
brown
dog
ran
z1
Output representation
x1
x2
x3
x4
Input vectors
Under the hood, each xi has 3 small, associated vectors. For example, x1 has:
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
Under the hood, each xi has 3 small, associated vectors. For example, x1 has:
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 1: Our Self-Attention Head has just 3 weight matrices Wq, Wk, Wv in total. These same 3 weight matrices are multiplied by each xi to create all vectors:
qi = wq xi
ki = wk xi
vi = wv xi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x1, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x1, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x1, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x1, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Instead of these ai values directly weighting our original xi word vectors, they directly weight our vi vectors.
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 4: Let’s weight our vi vectors and simply sum them up!
z1
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 5: We repeat this for all other words, yielding us with great, new zi representations!
z2
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 5: We repeat this for all other words, yielding us with great, new zi representations!
z3
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 5: We repeat this for all other words, yielding us with great, new zi representations!
z4
Let’s illustrate another example:
z2
Remember, we use the same 3 weight matrices Wq, Wk, Wv as we did for computing z1.
This gives us q2, k2, v2
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
Under the hood, each xi has 3 small, associated vectors. For example, x1 has:
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 1: Our Self-Attention Head l has just 3 weight matrices Wq, Wk, Wv in total. These same 3 weight matrices are multiplied by each xi to create all vectors:
qi = wq xi
ki = wk xi
vi = wv xi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Instead of these ai values directly weighting our original xi word vectors, they directly weight our vi vectors.
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 4: Let’s weight our vi vectors and simply sum them up!
z2
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Tada! Now we have great, new representations zi via a self-attention head
z4
z3
z2
z1
Self-attention Head
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Tada! Now we have great, new representations zi via a self-attention head
z4
z3
z2
z1
Takeaway:
�Self-Attention allows us to create great, context-aware representations
Self-Attention’s outputs (z) and inputs (x) have the same shape
Introduction to Large Language Models System
Zirui “Ray” Liu
University of Minnesota, Twin Cities
Transformer Architecture
Outline
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
Under the hood, each xi has 3 small, associated vectors. For example, x1 has:
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 1: Our Self-Attention Head l has just 3 weight matrices Wq, Wk, Wv in total. These same 3 weight matrices are multiplied by each xi to create all vectors:
qi = wq xi
ki = wk xi
vi = wv xi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 2: For word x2, let’s calculate the scores s1, s2, s3, s4, which represent how much attention to pay to each respective ”word” vi
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Instead of these ai values directly weighting our original xi word vectors, they directly weight our vi vectors.
Self-Attention
The
brown
dog
ran
x1
x2
x3
x4
q2
k2
v2
q1
k1
v1
q3
k3
v3
q4
k4
v4
Step 4: Let’s weight our vi vectors and simply sum them up!
z2
Outline
Layer Normalization
Normalize
within
each
word
embedding
Outline
Feedforward Network
Transformer Decoder
<s>
El
perro
marrón
x1
x2
x3
x4
Masked Self-attention Head
Decoder
r2
r3
FFNN
r4
r1
+ x Residual Connections +LayerNorm
z1A
z1B
z1C
z2A
z2B
z2C
z3A
z3B
z3C
z4A
z4B
z4C
=
Transformer Encoders and Decoders
The
brown
dog
ran
x1
x2
x3
x4
Transformer Encoders produce contextualized embeddings of each word
Encoder #1
Encoder #2
Encoder #8
Transformer Decoders generate new sequences of text
Decoder #1
Decoder #2
Decoder #8
hnědý
pes
běžel
Transformer
Transformer Encoders and Decoders
The
brown
dog
ran
x1
x2
x3
x4
Encoder #1
Encoder #2
Encoder #8
Transformer Decoders are identical to the Encoders, except they have an additional Attention Head in between the Self-Attention and FFNN layers.
This additional Attention Head focuses on parts of the encoder’s representations.
Decoder #1
Decoder #2
Decoder #8
hnědý
pes
běžel
Transformer
NOTE
Transformer Encoders and Decoders
The
brown
dog
ran
x1
x2
x3
x4
Encoder #1
Encoder #2
Encoder #8
The query vector for a Transformer Decoder’s Attention Head (not Self-Attention Head) is from the output of the previous decoder layer.
However, the key and value vectors are from the Transformer Encoders’ outputs.
Decoder #1
Decoder #2
Decoder #8
hnědý
pes
běžel
Transformer
NOTE
Transformer Encoders and Decoders
The
brown
dog
ran
x1
x2
x3
x4
Encoder #1
Encoder #2
Encoder #8
The query, key, and value vectors for a Transformer Decoder’s Self-Attention Head (not Attention Head) are all from the output of the previous decoder layer.
Decoder #1
Decoder #2
Decoder #8
hnědý
pes
běžel
Transformer
NOTE
Transformer Encoders and Decoders
The
brown
dog
ran
x1
x2
x3
x4
Encoder #1
Encoder #2
Encoder #8
The Transformer Decoders have positional embeddings, too, just like the Encoders.
Critically, each position is only allowed to attend to the previous indices. This masked Attention preserves it as being an auto-regressive LM.
Decoder #1
Decoder #2
Decoder #8
hnědý
pes
běžel
Transformer
IMPORTANT
https://jalammar.github.io/illustrated-transformer/
https://jalammar.github.io/illustrated-transformer/
https://jalammar.github.io/illustrated-transformer/
https://arxiv.org/pdf/1706.03762.pdf
Machine Translation results: state-of-the-art (at the time)
:
Machine Translation results: state-of-the-art (at the time)
You can train to translate from Language A to Language B.
Then train it to translate from Language B. to Language C.
Then, without training, it can translate from Language A to Language C
Introduction to Large Language Models System
Zirui “Ray” Liu
University of Minnesota, Twin Cities
Transformer++
Llama’s change over Architecture
RMS-Norm
RMS-Norm
RMS-Norm
RMS-Norm
Three-pass over x
Two-pass
over x
GeLU & SwiGLU
LLama FFN
Original Transformer
Llama
Up proj
Down proj
GeLU
Up proj
Down proj
SwiLU
Gate proj