1 of 32

MSAIL Sign-In!

https://forms.gle/6mM42WjKts18Vvta8

2 of 32

Attention and Transformers

MSAIL Reading Group

3/15/2022

Nisreen Bahrainwala

3 of 32

Agenda

1) Intro to Machine Translation
2) RNNs
3) Encoder-Decoder model
4) Attention
5) Self Attention
6) Multi-head Attention
7) Transformers

4 of 32

What is Machine Translation?

Human Translation

Hinges on a human understanding both the syntactic and semantic rules of both languages
Many techniques such as transposition and modulation to find correct words

Machine Translation

Target sentence y that maximizes the conditional probability of y given x
argmax(p(y|x))
The machine doesn’t “understand” anything

5 of 32

Where it all started: Recurrent Neural Networks (RNNs)

Used to model sequenced data (sentences)
Similar to a feed forward network
Ability to keep track of “recent past”
TODO

6 of 32

Encoder

Decoder

Encoder - Decoder Model with RNNs

End result is a vector with all the information from all the hidden states

End result is a vector that has decomposed the joint probability into ordered conditionals

7 of 32

8 of 32

9 of 32

10 of 32

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho, Bengio - ICLR 2015

11 of 32

Encoder

Decoder

Encoder and Decoder (new and improved)

- Same as RNN, bi-directional (unrolls the sentence forward and backwards)

- Context vector

- Annotations

- contains information about the words surrounding the ith word

- key difference, probability is conditioned on a distinct context vector c_i for each target word y_i

12 of 32

13 of 32

General Outline of Attention

14 of 32

Basic Dot Product

General Intuition

“Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values,

dependent on the query”

“Query attends to the values”

In the example:

Each decoder hidden state (query) attends to all the encoder hidden states (values).

15 of 32

16 of 32

17 of 32

18 of 32

19 of 32

20 of 32

21 of 32

Attention is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

2017

22 of 32

Main Ideas

Reduce computational complexity per layer
Increase the amount of computation that can be parallelized
Reduce path length between long-range dependencies in the network

23 of 32

Key Idea: No more RNNs

Encoder:

6 identical layers
Each layer has 2 sublayers

Multi-head self attention
Fully connected feed-forward network

Decoder:

6 identical layers
Each layer has 2 sublayers

First two layers are the same as the encoder
Multihead attention layer over the output of the first two sub-layers

24 of 32

Self Attention - Overview

Create a Query (q), Key (k) and Value (v) vector for each word in the input to the encoder
Calculate the score (Score(q, k) = q * k (dot product)
This score will tell us how much focus we should give other parts of the sentence w.r.t. the word we are encoding
Apply any scaling and pass the score through a softmax function
Multiply the value vector for the word by the score
Sum up all the value vectors (after step 4)

25 of 32

26 of 32

27 of 32

28 of 32

29 of 32

Multi-head Attention

Map the Q, K, V vectors to lower dimensional spaces
Allows the attention mechanism to take different “paths” through the sentence
There is a mechanism in the decoder that will ensure that words that come after the ith word aren’t taken into account when decoding

30 of 32

31 of 32

http://nlp.seas.harvard.edu/2018/04/03/attention.html

32 of 32

THANKS

Please keep this slide for attribution

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik