1 of 31

''Midterm Review���Sookyung Kim�sookim@ewha.ac.kr

1

2 of 31

Announcement

  • Midterm is on 4/27 Monday in class time 12:30 - 1:45 PM
    • Closed-book, No calculator
    • Solution can be either English or Korean�
  • Topics: RNN, attention, Transformer�
  • Midterm format
    • 10 problems of Multiple Choice Question (객관식) [20 pts]
    • 4 problems of Analytical Question (주관식 서술형) [80pts]

2

3 of 31

Vanilla RNN

  • Then what should we do inside the RNN cell?
    • One simple way is taking linear transformations (with Whh and Wxh) on the two inputs (previous hidden state ht-1 and input xt),
    • then take a nonlinearity before updating it as ht.
    • For the output, we may put another linear transformation from ht.

xt

fW

yt

ht-1

ht

Whh

Wxh

Why

3

4 of 31

Review: Attention Idea

  • Attention function: Attention (Q, K, V) = Attention value�For a query (context) and key-value pairs (references), attention value is the weighted average of values, where each weight is proportional to the relevance between the query and the corresponding key.
    • Q and K must be comparable (usually in the same dimensionality).
    • V and Attention value are in the same dimensionality, obviously.
    • In many applications, all four of these are in same dimensionality.

Query

Key1

Key2

Key3

Value1

Value2

Value3

Attention�Value

4

5 of 31

Transformer: Main Idea

  • So, what should be the Query, Key, and Value?

xi

Qi

Ki

Vi

WQ

WK

WV

Vi

xi

WO

  • With Transformer, we make them!
    • From the input tokens {x1, x2, …, xN},
    • Each token xi is mapped to its own Query Qi, Key Ki, Value Vi vectors by a linear transformation.
    • The linear weights (WQ, WK, WV) are the learned parameters, shared by all inputs.
    • WQ (WK, WV) learns how to represent a vector to serve as a Query (Key, Value) in general.�
  • We need another learnable parameter, WO, which maps the attention value back to the original space.

5

6 of 31

Transformer: Main Idea

  • Then, how do we perform attention?
    • Each token xi becomes the Query when we learn about i.
      • You are the main character in your life!
    • References are all tokens {x1, … ,xN} in the input sequence, including xi itself.
      • Your friends are a mirror that reflects you.
  • From this, we perform the attention:
    • Each element xi is represented as a weighted (similarity computed using Key) sum of other elements (using Value) in x.
    • zi = w1V1 + … + wNVN, where wj = cos(Qi, Kj)
    • WO maps from the Value space back to the original embedding space.

x1

Q1

K1

V1

K2

V2

x2

x3

K3

ׅ

ׅ

ׅ

0.93

0.01

0.06

V3

×

×

×

+ׅ

z1

WO

WQ

WK

WK

WK

WV

WV

WV

For i = 1:

The same procedure is performed for all i = 1, …, N.

6

7 of 31

Transformer: Main Idea

  • This resulting embedding z1 tends to be similar to its original one (x1), because cos(Q1, K1) is likely to be much higher than other cos(Q1, Ki).�
  • The resulting z1 is still not exactly the same as the original one, slightly affected by its context (here, x2, x3).�
  • Usually, this step is repeated multiple times to further contextualize.

x1

Q1

K1

V1

K2

V2

x2

x3

K3

ׅ

ׅ

ׅ

0.93

0.01

0.06

V3

×

×

×

+ׅ

z1

WO

WQ

WK

WK

WK

WV

WV

WV

For i = 1:

The same procedure is performed for all i = 1, …, N.

7

8 of 31

Inside the Transformer

Step 1: Input Embedding

  • Input is a sequence of tokens.
    • Each token is a same-sized vector, represented in a modality-specific way.
    • Examples
      • Text: pre-trained word embeddings
      • Image: fixed size small image patches
      • Video: frame embeddings

8

9 of 31

Transformer (Encoder)

Step 2: Contextualizing the Embeddings

  • Query, Key, Value representations:
    • For each word, we learn to map it to Q, K, V: instead of using the original embedding, (usually smaller) representation to work like a query, key, and value by linear transformation.

At the beginning, Q, K, V are just a random projection of input X.

As those words are encountered during training, W{Q,K,V} will be gradually map X to each so that Q, K, V to serve as its own purpose.

9

10 of 31

Transformer (Encoder)

Step 2: Contextualizing the Embeddings

  • Self-attention:
    • We will play in this smaller Q, K, V space to attend.
    • For each input word as query (Q), we compute similarity with all words as key (K), including the queried word, in the sequence.
    • Then, all words as value (V) are weighted-summed.�→ This is the attention value (Z), the contextualized new word embedding of same size.

As the query Q itself is included in the weighted sum, Z tends to be still self-dominated.

10

11 of 31

Transformer (Encoder)

Step 2: Contextualizing the word embedding

  • Multi-head Self-attention:

Having multiple projections to Q, K, V is beneficial.

Allows the model to jointly attend to information from different representation subspaces at different positions.

A motivating example →

11

12 of 31

Transformer (Encoder)

Step 2: Contextualizing the word embedding

  • Multi-head Self-attention:
    • Multiple self-attentions output multiple attention values (Z0, Z1, …, Zk-1)
    • Simply concatenate them, then linearly transform it back to the original input size with WO.

12

13 of 31

Transformer (Encoder)

13

14 of 31

Transformer (Encoder)

Step 3: Feed-forward Layer

  • Each contextualized embedding goes through an additional FC layer.�
    • Applied separately and identically; there is no cross-token dependency.
  • The output is still a same-size contextualized token embedding.

Residual connection and layer normalization is added at the end of multi-head self-attention and FC layer.

14

15 of 31

Transformer (Encoder)

Stacked Self-attention Blocks

  • Multiple (N) self-attention blocks are stacked.
    • The lowest block takes a pre-trained embedding as input.
    • Blocks stacked after the first one take the output embedding of the previous one.�
  • The last layer outputs a same-length sequence of transformed tokens as input, where each of them is also a same-sized contextualized embedding.

15

16 of 31

Transformer (Encoder)

Positional Encoding

  • Unlike RNNs, the tokens had no concept of order.
    • Each token just attended other tokens in the sequence.
  • To inject the order information, Transformer adds positional encoding in the input:
    • Same words at different locations will have different overall representations.
    • With sinusoidal encoding, we can deal with arbitrarily long sequences at test time.

16

17 of 31

Transformer (Encoder)

Positional Encoding

  • With i = 0, …, dmodel/2 - 1:
    • 2i/dmodel: gradually increase from 0 to 1
    • 1/100002i/d: gradually decrease from 1 to 0
    • With higher i (rear indices), the input (pos/100002i/d) changes slowly.�With lower i (front indices), the input (pos/100002i/d) changes more frequently.
  • No two different indices have same encoding.
  • Adjacent pos (the order of word in a sentence) has similar positional encoding.
    • There is no absolute binding between the order of a word and its role in the sentence. �(With longer subject, the verb is pushed to a rearer index.)

i

pos

dmodel

17

18 of 31

Transformer (Decoder)

Step 4: Decoder Input

  • Given Z = {z1, …, zn} (the encoder output), generates an output sequence auto-regressively.
    • Consumes the previously generated symbol as additional input when generating the next.
  • Positional encoding is applied in the same manner as the Encoder.

18

19 of 31

Transformer (Decoder)

Step 5: Masked Multi-head Self-attention

  • The input sequence (here, the generated output so far) is fed into multi-head self-attention layer as in the encoder.�
  • Since we have no idea after the current time step, they are masked out.�
  • Other than this masking, this is exactly the same as multi-head self-attention layer in the encoder:
    • Each token in the input is contextualized and transformed.

19

20 of 31

Transformer (Decoder)

Step 6: Encoder-Decoder Attention

  • Now, the input attends the encoder output:
    • Q: the query from decoder
    • K, V: the key and value from encoder
    • Other than this, same as multi-head self-attention layer.
    • No masking in this layer, as it is okay (and necessary) to look at the entire encoded sequence.

20

21 of 31

Transformer (Decoder)

Step 7: Feed-forward Layer

  • Same as encoder
  • Residual, layer normalization: same as encoder
  • N stacked blocks: same as encoder
    • The last layer output is fed as input in the next time step.

Step 8: Linear Layer

  • Maps the output embedding to class scores.
    • Output size: equal to the vocabulary size

21

22 of 31

Transformer (Decoder)

Step 9: Softmax Layer

  • Takes softmax to the class scores to make them as a probability.
  • These scores (that are summed to 1) are compared with 1-hot-encoded ground truth.
    • Then, backprops with some loss (e.g., cross-entropy)
  • These decoding steps are repeated until the next word is predicted as [EOS] (End of Sentence).
  • The output sentence may be chosen greedily (always with the top one), or deferred with top-k choices (called beam search).

22

23 of 31

Bidirectional Encoder Representations from Transformers (BERT)

23

24 of 31

BERT

  • Bidirectional Encoder Representations from Transformers
    • Large-scale pre-training of word embeddings using Transformer encoder
    • Self-supervised: no human rating required
    • Use the encoder (bi-directional; no masking) only�
  • https://arxiv.org/pdf/1810.04805.pdf

24

25 of 31

BERT

  • Input sequence consists of two sentences, with sum of three things:
    • Token embedding: a pre-trained word embedding (WordPiece)
      • [CLS]: Classification token, put always at the beginning. Final hidden state for this token is used as the aggregate sequence representation for classification tasks
      • [SEP]: Separator token, used to mark the end of a sentence
    • Segment embedding: a learned embedding indicating which sentence each token belongs to
    • Position embedding: a learned embedding for each position

25

26 of 31

BERT

  • Training task 1: Masked Language Modeling (MLM)
    • Similar to sentence completion in standard English exams: figuring out the hidden words using the context.
    • Masking 15% of tokens randomly (substituting it to a special [MASK] token).
    • Classify the output embedding for these positions across the vocabulary.

26

27 of 31

BERT

  • Training task 2: Next Sentence Prediction (NSP)
    • A binary classification problem, predicting if the two sentences in the input are consecutive or not.
    • Half of training data contains two consecutive sentences (B is the actual next sentence of A).
    • The other half contains two sentences randomly chosen from the corpus.�
  • According to the authors, their model achieved ~98% accuracy on this task, and this was very beneficial to multiple tasks.
    • Later, turns out to be less important than MLM.�
  • These days, the pre-trained BERT is a default choice for word embeddings.

27

28 of 31

Transformer for Image Data

28

29 of 31

ViT: Vision Transformer

  • The standard Transformer model is directly applied to images:
    • An image is split into 16×16 patches. (Each token is a 16×16 image patch instead of a word.)
    • The sequence of linear embeddings of these patches are fed into a Transformer.
    • Image patches are treated on the same way as tokens (words).
    • Eventually, an MLP is added on top of the [CLS] token to classify the input image.

29

30 of 31

ViT: Vision Transformer

Patch embedding�by Linear projection E�P2C → D (C=3, D=1024)

[CLS]

xclass

xp1

xp2

xp3

xp4

xp5

xp6

xp7

xp8

xp9

Input tokens xP×P (P=16 or 32)

Positional Encoding EposN+1 → D�(N=#patch, D=1024)

Multihead Self-Attention (MSA)

Multilayer Perceptron (MLP)

Input image

1

2

3

4

5

6

7

8

9

0

Transformer Encoder

MLP

30

31 of 31

ViT: Position Embeddings

  • ViT learns to encode distance within the image in the similarity of position embeddings.
    • Closer patches tend to have more similar position embeddings.�
  • The row-column structure appears.
    • Patches in the same row/column have similar embeddings, automatically learned from data.�
  • Hand-crafted 2D-aware embedding variants do not yield improvements for this reason.

31