1 of 31

''Midterm Review��Sookyung Kim�sookim@ewha.ac.kr

1

2 of 31

Announcement

Midterm is on 4/27 Monday in class time 12:30 - 1:45 PM

Closed-book, No calculator
Solution can be either English or Korean�

Topics: RNN, attention, Transformer�
Midterm format

10 problems of Multiple Choice Question (객관식) [20 pts]
4 problems of Analytical Question (주관식 서술형) [80pts]

2

3 of 31

Vanilla RNN

Then what should we do inside the RNN cell?

One simple way is taking linear transformations (with W_hh and W_xh) on the two inputs (previous hidden state h_t-1 and input x_t),
then take a nonlinearity before updating it as h_t.
For the output, we may put another linear transformation from h_t.

x_t

f_W

y_t

h_t-1

h_t

W_hh

W_xh

W_hy

3

4 of 31

Review: Attention Idea

Attention function: Attention (Q, K, V) = Attention value�For a query (context) and key-value pairs (references), attention value is the weighted average of values, where each weight is proportional to the relevance between the query and the corresponding key.

Q and K must be comparable (usually in the same dimensionality).
V and Attention value are in the same dimensionality, obviously.
In many applications, all four of these are in same dimensionality.

Query

Key1

Key2

Key3

Value1

Value2

Value3

Attention�Value

4

5 of 31

Transformer: Main Idea

So, what should be the Query, Key, and Value?

x_i

Q_i

K_i

V_i

W_Q

W_K

W_V

∑ V_i

x_i

W_O

With Transformer, we make them!

From the input tokens {x₁, x₂, …, x_N},
Each token x_i is mapped to its own Query Q_i, Key K_i, Value V_i vectors by a linear transformation.
The linear weights (W_Q, W_K, W_V) are the learned parameters, shared by all inputs.
W_Q (W_K, W_V) learns how to represent a vector to serve as a Query (Key, Value) in general.�

We need another learnable parameter, W_O, which maps the attention value back to the original space.

https://arxiv.org/pdf/1706.03762.pdf

5

6 of 31

Transformer: Main Idea

Then, how do we perform attention?

Each token x_i becomes the Query when we learn about i.

You are the main character in your life!

References are all tokens {x₁, … ,x_N} in the input sequence, including x_i itself.

Your friends are a mirror that reflects you.�

From this, we perform the attention:

Each element x_i is represented as a weighted (similarity computed using Key) sum of other elements (using Value) in x.
z_i = w₁V₁ + … + w_NV_N, where w_j = cos(Q_i, K_j)
W_O maps from the Value space back to the original embedding space.

x₁

Q₁

K₁

V₁

K₂

V₂

x₂

x₃

K₃

ׅ

0.93

0.01

0.06

V₃

×

+ׅ

z₁

W_O

W_Q

W_K

W_V

For i = 1:

The same procedure is performed for all i = 1, …, N.

6

7 of 31

Transformer: Main Idea

This resulting embedding z₁ tends to be similar to its original one (x₁), because cos(Q₁, K₁) is likely to be much higher than other cos(Q₁, K_i).�
The resulting z₁ is still not exactly the same as the original one, slightly affected by its context (here, x₂, x₃).�
Usually, this step is repeated multiple times to further contextualize.

x₁

Q₁

K₁

V₁

K₂

V₂

x₂

x₃

K₃

ׅ

0.93

0.01

0.06

V₃

×

+ׅ

z₁

W_O

W_Q

W_K

W_V

For i = 1:

The same procedure is performed for all i = 1, …, N.

7

8 of 31

Inside the Transformer

Step 1: Input Embedding

Input is a sequence of tokens.

Each token is a same-sized vector, represented in a modality-specific way.
Examples

Text: pre-trained word embeddings
Image: fixed size small image patches
Video: frame embeddings

8

9 of 31

Transformer (Encoder)

Step 2: Contextualizing the Embeddings

Query, Key, Value representations:

For each word, we learn to map it to Q, K, V: instead of using the original embedding, (usually smaller) representation to work like a query, key, and value by linear transformation.

At the beginning, Q, K, V are just a random projection of input X.

As those words are encountered during training, W^{^Q^,^K^,^V^} will be gradually map X to each so that Q, K, V to serve as its own purpose.

9

10 of 31

Transformer (Encoder)

Step 2: Contextualizing the Embeddings

Self-attention:

We will play in this smaller Q, K, V space to attend.
For each input word as query (Q), we compute similarity with all words as key (K), including the queried word, in the sequence.
Then, all words as value (V) are weighted-summed.�→ This is the attention value (Z), the contextualized new word embedding of same size.

As the query Q itself is included in the weighted sum, Z tends to be still self-dominated.

10

11 of 31

Transformer (Encoder)

Step 2: Contextualizing the word embedding

Multi-head Self-attention:

Having multiple projections to Q, K, V is beneficial.

Allows the model to jointly attend to information from different representation subspaces at different positions.

A motivating example →

11

12 of 31

Transformer (Encoder)

Step 2: Contextualizing the word embedding

Multi-head Self-attention:

Multiple self-attentions output multiple attention values (Z₀, Z₁, …, Z_k-1)
Simply concatenate them, then linearly transform it back to the original input size with W^O.

12

13 of 31

Transformer (Encoder)

13

14 of 31

Transformer (Encoder)

Step 3: Feed-forward Layer

Each contextualized embedding goes through an additional FC layer.�

Applied separately and identically; there is no cross-token dependency.

The output is still a same-size contextualized token embedding.

Residual connection and layer normalization is added at the end of multi-head self-attention and FC layer.

14

15 of 31

Transformer (Encoder)

Stacked Self-attention Blocks

Multiple (N) self-attention blocks are stacked.

The lowest block takes a pre-trained embedding as input.
Blocks stacked after the first one take the output embedding of the previous one.�

The last layer outputs a same-length sequence of transformed tokens as input, where each of them is also a same-sized contextualized embedding.

15

16 of 31

Transformer (Encoder)

Positional Encoding

Unlike RNNs, the tokens had no concept of order.

Each token just attended other tokens in the sequence.

To inject the order information, Transformer adds positional encoding in the input:

Same words at different locations will have different overall representations.
With sinusoidal encoding, we can deal with arbitrarily long sequences at test time.

16

17 of 31

Transformer (Encoder)

Positional Encoding

With i = 0, …, d_model/2 - 1:

2i/d_model: gradually increase from 0 to 1
1/10000²ⁱ^/^d: gradually decrease from 1 to 0
With higher i (rear indices), the input (pos/10000²ⁱ^/^d) changes slowly.�With lower i (front indices), the input (pos/10000²ⁱ^/^d) changes more frequently.

No two different indices have same encoding.
Adjacent pos (the order of word in a sentence) has similar positional encoding.

There is no absolute binding between the order of a word and its role in the sentence. �(With longer subject, the verb is pushed to a rearer index.)

i

pos

d_model

17

18 of 31

Transformer (Decoder)

Step 4: Decoder Input

Given Z = {z₁, …, z_n} (the encoder output), generates an output sequence auto-regressively.

Consumes the previously generated symbol as additional input when generating the next.

Positional encoding is applied in the same manner as the Encoder.

18

19 of 31

Transformer (Decoder)

Step 5: Masked Multi-head Self-attention

The input sequence (here, the generated output so far) is fed into multi-head self-attention layer as in the encoder.�
Since we have no idea after the current time step, they are masked out.�
Other than this masking, this is exactly the same as multi-head self-attention layer in the encoder:

Each token in the input is contextualized and transformed.

19

20 of 31

Transformer (Decoder)

Step 6: Encoder-Decoder Attention

Now, the input attends the encoder output:

Q: the query from decoder
K, V: the key and value from encoder
Other than this, same as multi-head self-attention layer.
No masking in this layer, as it is okay (and necessary) to look at the entire encoded sequence.

20

21 of 31

Transformer (Decoder)

Step 7: Feed-forward Layer

Same as encoder

Residual, layer normalization: same as encoder
N stacked blocks: same as encoder

The last layer output is fed as input in the next time step.

Step 8: Linear Layer

Maps the output embedding to class scores.

Output size: equal to the vocabulary size

21

22 of 31

Transformer (Decoder)

Step 9: Softmax Layer

Takes softmax to the class scores to make them as a probability.
These scores (that are summed to 1) are compared with 1-hot-encoded ground truth.

Then, backprops with some loss (e.g., cross-entropy)

These decoding steps are repeated until the next word is predicted as [EOS] (End of Sentence).
The output sentence may be chosen greedily (always with the top one), or deferred with top-k choices (called beam search).

22

23 of 31

Bidirectional Encoder Representations from Transformers (BERT)

23

24 of 31

BERT

Bidirectional Encoder Representations from Transformers

Large-scale pre-training of word embeddings using Transformer encoder
Self-supervised: no human rating required
Use the encoder (bi-directional; no masking) only�

https://arxiv.org/pdf/1810.04805.pdf

24

25 of 31

BERT

Input sequence consists of two sentences, with sum of three things:

Token embedding: a pre-trained word embedding (WordPiece)

[CLS]: Classification token, put always at the beginning. Final hidden state for this token is used as the aggregate sequence representation for classification tasks
[SEP]: Separator token, used to mark the end of a sentence

Segment embedding: a learned embedding indicating which sentence each token belongs to
Position embedding: a learned embedding for each position

25

26 of 31

BERT

Training task 1: Masked Language Modeling (MLM)

Similar to sentence completion in standard English exams: figuring out the hidden words using the context.
Masking 15% of tokens randomly (substituting it to a special [MASK] token).
Classify the output embedding for these positions across the vocabulary.

26

27 of 31

BERT

Training task 2: Next Sentence Prediction (NSP)

A binary classification problem, predicting if the two sentences in the input are consecutive or not.
Half of training data contains two consecutive sentences (B is the actual next sentence of A).
The other half contains two sentences randomly chosen from the corpus.�

According to the authors, their model achieved ~98% accuracy on this task, and this was very beneficial to multiple tasks.

Later, turns out to be less important than MLM.�

These days, the pre-trained BERT is a default choice for word embeddings.

27

28 of 31

Transformer for Image Data

28

29 of 31

ViT: Vision Transformer

The standard Transformer model is directly applied to images:

An image is split into 16×16 patches. (Each token is a 16×16 image patch instead of a word.)
The sequence of linear embeddings of these patches are fed into a Transformer.
Image patches are treated on the same way as tokens (words).
Eventually, an MLP is added on top of the [CLS] token to classify the input image.

https://arxiv.org/pdf/2010.11929.pdf

29

30 of 31

ViT: Vision Transformer

Patch embedding�by Linear projection E�P²C → D (C=3, D=1024)

[CLS]

x_class

x_p¹

x_p²

x_p³

x_p⁴

x_p⁵

x_p⁶

x_p⁷

x_p⁸

x_p⁹

Input tokens x�P×P (P=16 or 32)

Positional Encoding E_pos�N+1 → D�(N=#patch, D=1024)

Multihead Self-Attention (MSA)

Multilayer Perceptron (MLP)

Input image

1

2

3

4

5

6

7

8

9

0

Transformer Encoder

MLP

30

31 of 31

ViT: Position Embeddings

ViT learns to encode distance within the image in the similarity of position embeddings.

Closer patches tend to have more similar position embeddings.�

The row-column structure appears.

Patches in the same row/column have similar embeddings, automatically learned from data.�

Hand-crafted 2D-aware embedding variants do not yield improvements for this reason.

31