1 of 32

MSAIL Sign-In!

https://forms.gle/6mM42WjKts18Vvta8

2 of 32

Attention and Transformers

MSAIL Reading Group

3/15/2022

Nisreen Bahrainwala

3 of 32

Agenda

  1. 1) Intro to Machine Translation
  2. 2) RNNs
  3. 3) Encoder-Decoder model
  4. 4) Attention
  5. 5) Self Attention
  6. 6) Multi-head Attention
  7. 7) Transformers

4 of 32

What is Machine Translation?

Human Translation

  • Hinges on a human understanding both the syntactic and semantic rules of both languages
  • Many techniques such as transposition and modulation to find correct words

Machine Translation

  • Target sentence y that maximizes the conditional probability of y given x
  • argmax(p(y|x))
  • The machine doesn’t “understand” anything

5 of 32

Where it all started: Recurrent Neural Networks (RNNs)

  • Used to model sequenced data (sentences)
  • Similar to a feed forward network
  • Ability to keep track of “recent past”
  • TODO

6 of 32

Encoder

Decoder

Encoder - Decoder Model with RNNs

End result is a vector with all the information from all the hidden states

End result is a vector that has decomposed the joint probability into ordered conditionals

7 of 32

8 of 32

9 of 32

10 of 32

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho, Bengio - ICLR 2015

11 of 32

Encoder

Decoder

Encoder and Decoder (new and improved)

- Same as RNN, bi-directional (unrolls the sentence forward and backwards)

- Context vector

- Annotations

- contains information about the words surrounding the ith word

- key difference, probability is conditioned on a distinct context vector c_i for each target word y_i

12 of 32

13 of 32

General Outline of Attention

14 of 32

Basic Dot Product

General Intuition

“Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values,

dependent on the query”

“Query attends to the values”

In the example:

Each decoder hidden state (query) attends to all the encoder hidden states (values).

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

Attention is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

2017

22 of 32

Main Ideas

  • Reduce computational complexity per layer
  • Increase the amount of computation that can be parallelized
  • Reduce path length between long-range dependencies in the network

23 of 32

Key Idea: No more RNNs

Encoder:

  • 6 identical layers
  • Each layer has 2 sublayers
    • Multi-head self attention
    • Fully connected feed-forward network

Decoder:

  • 6 identical layers
  • Each layer has 2 sublayers
    • First two layers are the same as the encoder
    • Multihead attention layer over the output of the first two sub-layers

24 of 32

Self Attention - Overview

  1. Create a Query (q), Key (k) and Value (v) vector for each word in the input to the encoder
  2. Calculate the score (Score(q, k) = q * k (dot product)
  3. This score will tell us how much focus we should give other parts of the sentence w.r.t. the word we are encoding
  4. Apply any scaling and pass the score through a softmax function
  5. Multiply the value vector for the word by the score
  6. Sum up all the value vectors (after step 4)

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

Multi-head Attention

  • Map the Q, K, V vectors to lower dimensional spaces
  • Allows the attention mechanism to take different “paths” through the sentence
  • There is a mechanism in the decoder that will ensure that words that come after the ith word aren’t taken into account when decoding

30 of 32

31 of 32

32 of 32

THANKS

Please keep this slide for attribution

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik