1 of 39

Assignment 4 and Transformers

CSE 447 / 517

FEB 20TH, 2025 (WEEK 7)

2 of 39

Logistics

Assignment 4 (A4) is due on Friday, 2/21 (Extension)
Project Checkpoint 3 is due on Monday, 3/03

3 of 39

Agenda

Assignment 4
Transformers

Self Attention
Multi—headed Attention
Transformer Architecture
Language+Vision Models

4 of 39

Assignment 4

5 of 39

N-Gram Word-Level Language Models

Implementing N-gram Language models

Create a vocabulary of words
Count N-grams
Get possible single word continuation w.r.t each N-1 gram in the training data

6 of 39

N-Gram Word-Level Language Models

Eval Perplexity

For each sentence in eval_data_processed

Get log-probability
Count the number of words

Calculate perplexity exp(-log probability/number of words)

7 of 39

N-Gram Word-Level Language Models

Get log-probability

For each index from N-1 to length of words

Get the n-1 gram before current index, and the word at current index
Get the counts for both
Accumulate log of (ngram count/ n-1 gram count)
use probability 0 if the N gram doesn’t exist to get the divide by zero RuntimeWarning

Tip: helper function get_word_prob: Takes in an (N-1)gram and a next word, and returns the probability of that word completing the N-gram

8 of 39

N-Gram Word-Level Language Models

Sample Text

While the number of generated words sampled is less than max words

Get next word probability for N-1 gram
Normalize the probability
Sample the words using normalized probability
Increment the generated words

9 of 39

N-Gram Word-Level Language Models

Get next word probability

Build a dictionary with possible next words, with N-1 gram as the key and the possible next word as value
Loop through the possible next words

Get the probability as the Count of N grams/ N-1 grams

Tip: helper function get_next_word_probs: Takes in an (N-1)gram and returns a dictionary of words to their probability of following the (N-1)gram.

10 of 39

Laplace and Add-Lambda Smoothing

Eval Perplexity, Sample text, same structure as before

Get log-probability

For each index from N-1 to length of words

Get the n-1 gram before current index, and the word at current index
Get the counts for both
Accumulate log of ((ngram count + k)/ (n-1 gram count + k*length of vocab))
use 0 if the N gram doesn’t exist to get the divide by zero RuntimeWarning

11 of 39

Laplace and Add-Lambda Smoothing

Get next word probability

Build a dictionary with possible next words, with N-1 gram as the key and the possible next word as value
Loop through the possible next words

Get the probability as the (Count of N grams + K) / N-1 grams + K * length of vocab)

What if N=1?
N-1 gram is Total words

12 of 39

Transformers

Slides adapted from Hichem Felouat - hichemfel@nii.ac.jp - 2024

13 of 39

Self Attention

The following sentence is an input sentence we want to translate: "The animal didn't cross the street because it was too tired“
What does "it" in this sentence refer to?
Is it referring to the street or to the animal? It’s a simple question to a human but not as simple to an algorithm.
When the model is processing the word "it", self-attention allows it to associate "it" with "animal".

14 of 39

Self Attention

As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".

15 of 39

Self Attention

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

16 of 39

Self Attention

Dot product

17 of 39

Self Attention

Every row in the X matrix corresponds to a word in the input sentence.

18 of 39

Attention Code

19 of 39

Multi-Headed Attention

Multi-Headed Attention improves the performance of the attention layer in two ways:
It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself.
It gives the attention layer multiple representation subspaces.

20 of 39

Multi-Headed Attention

21 of 39

Multi-Headed Attention

As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" , in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

If we add all the attention heads to the picture, however, things can be harder to interpret.

22 of 39

Transformer

Positional Encoding:

The transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.

23 of 39

Transformer

24 of 39

Transformer

25 of 39

Transformer

Attention Is All You Need

https://arxiv.org/abs/1706.03762

26 of 39

Transformer

The Annotated Transformer
a line-by-line implementation
http://nlp.seas.harvard.edu/annotated-transformer/

27 of 39

Transformer

Hichem Felouat - hichemfel@nii.ac.jp - 2024

28 of 39

Vision Transformer (ViT)

Performance benchmark comparison of Vision Transformers (ViT) with ResNet and MobileNet when trained from scratch on ImageNet.

29 of 39

Vision Transformer (ViT)

The authors in [1] demonstrated that CNNs trained on ImageNet are strongly biased towards recognizing textures rather than shapes. Below is an excellent example of such a case:

[1] ImageNet Trained CNN are Biased Towards Textures https://arxiv.org/pdf/1811.12231

30 of 39