''Midterm Review���Sookyung Kim�sookim@ewha.ac.kr
1
Announcement
2
Vanilla RNN
xt
fW
yt
ht-1
ht
Whh
Wxh
Why
3
Review: Attention Idea
Query
Key1
Key2
Key3
Value1
Value2
Value3
Attention�Value
4
Transformer: Main Idea
xi
Qi
Ki
Vi
WQ
WK
WV
∑ Vi
xi
WO
5
Transformer: Main Idea
x1
Q1
K1
V1
K2
V2
x2
x3
K3
ׅ
ׅ
ׅ
0.93
0.01
0.06
V3
×
×
×
+ׅ
z1
WO
WQ
WK
WK
WK
WV
WV
WV
For i = 1:
The same procedure is performed for all i = 1, …, N.
6
Transformer: Main Idea
x1
Q1
K1
V1
K2
V2
x2
x3
K3
ׅ
ׅ
ׅ
0.93
0.01
0.06
V3
×
×
×
+ׅ
z1
WO
WQ
WK
WK
WK
WV
WV
WV
For i = 1:
The same procedure is performed for all i = 1, …, N.
7
Inside the Transformer
Step 1: Input Embedding
8
Transformer (Encoder)
Step 2: Contextualizing the Embeddings
At the beginning, Q, K, V are just a random projection of input X.
As those words are encountered during training, W{Q,K,V} will be gradually map X to each so that Q, K, V to serve as its own purpose.
9
Transformer (Encoder)
Step 2: Contextualizing the Embeddings
As the query Q itself is included in the weighted sum, Z tends to be still self-dominated.
10
Transformer (Encoder)
Step 2: Contextualizing the word embedding
Having multiple projections to Q, K, V is beneficial.
Allows the model to jointly attend to information from different representation subspaces at different positions.
A motivating example →
11
Transformer (Encoder)
Step 2: Contextualizing the word embedding
12
Transformer (Encoder)
13
Transformer (Encoder)
Step 3: Feed-forward Layer
Residual connection and layer normalization is added at the end of multi-head self-attention and FC layer.
14
Transformer (Encoder)
Stacked Self-attention Blocks
15
Transformer (Encoder)
Positional Encoding
16
Transformer (Encoder)
Positional Encoding
i
pos
dmodel
17
Transformer (Decoder)
Step 4: Decoder Input
18
Transformer (Decoder)
Step 5: Masked Multi-head Self-attention
19
Transformer (Decoder)
Step 6: Encoder-Decoder Attention
20
Transformer (Decoder)
Step 7: Feed-forward Layer
Step 8: Linear Layer
21
Transformer (Decoder)
Step 9: Softmax Layer
22
Bidirectional Encoder Representations from Transformers (BERT)
23
BERT
24
BERT
25
BERT
26
BERT
27
Transformer for Image Data
28
ViT: Vision Transformer
29
ViT: Vision Transformer
Patch embedding�by Linear projection E�P2C → D (C=3, D=1024)
[CLS]
xclass
xp1
xp2
xp3
xp4
xp5
xp6
xp7
xp8
xp9
Input tokens x�P×P (P=16 or 32)
Positional Encoding Epos�N+1 → D�(N=#patch, D=1024)
Multihead Self-Attention (MSA)
Multilayer Perceptron (MLP)
Input image
1
2
3
4
5
6
7
8
9
0
Transformer Encoder
MLP
30
ViT: Position Embeddings
31