1 of 39

Assignment 4 and Transformers

CSE 447 / 517

FEB 20TH, 2025 (WEEK 7)

2 of 39

Logistics

  • Assignment 4 (A4) is due on Friday, 2/21 (Extension)
  • Project Checkpoint 3 is due on Monday, 3/03

3 of 39

Agenda

  • Assignment 4
  • Transformers
    • Self Attention
    • Multi—headed Attention
    • Transformer Architecture
    • Language+Vision Models

4 of 39

Assignment 4

5 of 39

N-Gram Word-Level Language Models

  • Implementing N-gram Language models

  • Fit
    • Create a vocabulary of words
    • Count N-grams
    • Get possible single word continuation w.r.t each N-1 gram in the training data

6 of 39

N-Gram Word-Level Language Models

  • Eval Perplexity
    • For each sentence in eval_data_processed
      • Get log-probability
      • Count the number of words
    • Calculate perplexity exp(-log probability/number of words)

7 of 39

N-Gram Word-Level Language Models

Get log-probability

For each index from N-1 to length of words

      • Get the n-1 gram before current index, and the word at current index
      • Get the counts for both
      • Accumulate log of (ngram count/ n-1 gram count)
      • use probability 0 if the N gram doesn’t exist to get the divide by zero RuntimeWarning

Tip: helper function get_word_prob: Takes in an (N-1)gram and a next word, and returns the probability of that word completing the N-gram

8 of 39

N-Gram Word-Level Language Models

  • Sample Text
    • While the number of generated words sampled is less than max words
      • Get next word probability for N-1 gram
      • Normalize the probability
      • Sample the words using normalized probability
      • Increment the generated words

9 of 39

N-Gram Word-Level Language Models

Get next word probability

    • Build a dictionary with possible next words, with N-1 gram as the key and the possible next word as value
    • Loop through the possible next words
      • Get the probability as the Count of N grams/ N-1 grams

Tip: helper function get_next_word_probs: Takes in an (N-1)gram and returns a dictionary of words to their probability of following the (N-1)gram.

10 of 39

Laplace and Add-Lambda Smoothing

  • Eval Perplexity, Sample text, same structure as before

Get log-probability

For each index from N-1 to length of words

      • Get the n-1 gram before current index, and the word at current index
      • Get the counts for both
      • Accumulate log of ((ngram count + k)/ (n-1 gram count + k*length of vocab))
      • use 0 if the N gram doesn’t exist to get the divide by zero RuntimeWarning

11 of 39

Laplace and Add-Lambda Smoothing

Get next word probability

    • Build a dictionary with possible next words, with N-1 gram as the key and the possible next word as value
    • Loop through the possible next words
      • Get the probability as the (Count of N grams + K) / N-1 grams + K * length of vocab)
  • What if N=1?
  • N-1 gram is Total words

12 of 39

Transformers

Slides adapted from Hichem Felouat - hichemfel@nii.ac.jp - 2024

13 of 39

Self Attention

  • The following sentence is an input sentence we want to translate: "The animal didn't cross the street because it was too tired“
  • What does "it" in this sentence refer to?
  • Is it referring to the street or to the animal? It’s a simple question to a human but not as simple to an algorithm.
  • When the model is processing the word "it", self-attention allows it to associate "it" with "animal".

14 of 39

Self Attention

  • As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and baked a part of its representation into the encoding of "it".

15 of 39

Self Attention

Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.

16 of 39

Self Attention

Dot product

17 of 39

Self Attention

  • Every row in the X matrix corresponds to a word in the input sentence.

18 of 39

Attention Code

19 of 39

Multi-Headed Attention

  • Multi-Headed Attention improves the performance of the attention layer in two ways:
  • It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself.
  • It gives the attention layer multiple representation subspaces.

20 of 39

Multi-Headed Attention

21 of 39

Multi-Headed Attention

As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" , in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".

If we add all the attention heads to the picture, however, things can be harder to interpret.

22 of 39

Transformer

Positional Encoding:

The transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.

23 of 39

Transformer

24 of 39

Transformer

25 of 39

Transformer

Attention Is All You Need

https://arxiv.org/abs/1706.03762

26 of 39

Transformer

27 of 39

Transformer

27

Hichem Felouat - hichemfel@nii.ac.jp - 2024

28 of 39

Vision Transformer (ViT)

Performance benchmark comparison of Vision Transformers (ViT) with ResNet and MobileNet when trained from scratch on ImageNet.

29 of 39

Vision Transformer (ViT)

The authors in [1] demonstrated that CNNs trained on ImageNet are strongly biased towards recognizing textures rather than shapes. Below is an excellent example of such a case:

[1] ImageNet Trained CNN are Biased Towards Textures https://arxiv.org/pdf/1811.12231

30 of 39

Vision Transformer (ViT) vs CNNs

  • Neuroscience studies (The importance of shape in early lexical learning [1]) showed that object shape is the single most important cue for human object recognition.
  • By studying the visual pathway of humans regarding image recognition, researchers identified that the perception of object shape is invariant to most perturbations. So as far as we know, the shape is the most reliable cue.
  • Intuitively, the object shape remains relatively stable, while other cues can be easily distorted by all sorts of noise [2].

[1] https://psycnet.apa.org/doi/10.1016/0885-2014(88)90014-7

[2] https://arxiv.org/abs/1811.12231

31 of 39

Vision Transformer (ViT) vs CNNs

Accuracies and example stimuli for five different experiments without cue conflict.

Source: https://arxiv.org/abs/1811.12231

32 of 39

Vision Transformer (ViT) vs CNNs

  • The texture is not sufficient for determining whether the zebra is rotated. Thus, predicting rotation requires modeling shape, to some extent.
  • The object's shape can be invariant to rotations.

33 of 39

Vision Transformer (ViT) vs CNNs

The self-attention captures long-range dependencies and contextual information in the input data.

The self-attention mechanism allows a ViT model to attend to different regions of the input data based on their relevance to the task at hand.

Raw images (Left) and attention maps of ViT-S/16 with (Right) and without (Middle).

https://arxiv.org/abs/2106.01548

34 of 39

Vision Transformer (ViT) vs CNNs

The authors in [1] looked at the self-attention of the CLS token on the heads of the last layer. Crucially, no labels are used during the self-supervised training. These maps demonstrate that the learned class-specific features lead to remarkable unsupervised segmentation masks and visibly correlate with the shape of semantic objects in the images.

1: Self-Supervised Vision Transformers with DINO https://arxiv.org/abs/2104.14294

35 of 39

Large Language Models

36 of 39

Language + Vision Models

37 of 39

Language + Vision Models

38 of 39

More on Transformer Code

Follow Sasha Rush’s tutorial: https://nlp.seas.harvard.edu/annotated-transformer/

39 of 39

Questions?

  • Thank you!