Assignment 4 and Transformers
CSE 447 / 517
FEB 20TH, 2025 (WEEK 7)
Logistics
Agenda
Assignment 4
N-Gram Word-Level Language Models
N-Gram Word-Level Language Models
N-Gram Word-Level Language Models
Get log-probability
For each index from N-1 to length of words
Tip: helper function get_word_prob: Takes in an (N-1)gram and a next word, and returns the probability of that word completing the N-gram
N-Gram Word-Level Language Models
N-Gram Word-Level Language Models
Get next word probability
Tip: helper function get_next_word_probs: Takes in an (N-1)gram and returns a dictionary of words to their probability of following the (N-1)gram.
Laplace and Add-Lambda Smoothing
Get log-probability
For each index from N-1 to length of words
Laplace and Add-Lambda Smoothing
Get next word probability
Transformers
Slides adapted from Hichem Felouat - hichemfel@nii.ac.jp - 2024
Self Attention
Self Attention
Self Attention
Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.
Self Attention
Dot product
Self Attention
Attention Code
Multi-Headed Attention
Multi-Headed Attention
Multi-Headed Attention
As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" , in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".
If we add all the attention heads to the picture, however, things can be harder to interpret.
Transformer
Positional Encoding:
The transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.
Transformer
Transformer
Transformer
Attention Is All You Need
Transformer
Transformer
27
Hichem Felouat - hichemfel@nii.ac.jp - 2024
Vision Transformer (ViT)
Performance benchmark comparison of Vision Transformers (ViT) with ResNet and MobileNet when trained from scratch on ImageNet.
Vision Transformer (ViT)
The authors in [1] demonstrated that CNNs trained on ImageNet are strongly biased towards recognizing textures rather than shapes. Below is an excellent example of such a case:
[1] ImageNet Trained CNN are Biased Towards Textures https://arxiv.org/pdf/1811.12231
Vision Transformer (ViT) vs CNNs
Vision Transformer (ViT) vs CNNs
Accuracies and example stimuli for five different experiments without cue conflict.
Source: https://arxiv.org/abs/1811.12231
Vision Transformer (ViT) vs CNNs
Vision Transformer (ViT) vs CNNs
The self-attention captures long-range dependencies and contextual information in the input data.
The self-attention mechanism allows a ViT model to attend to different regions of the input data based on their relevance to the task at hand.
Raw images (Left) and attention maps of ViT-S/16 with (Right) and without (Middle).
Vision Transformer (ViT) vs CNNs
The authors in [1] looked at the self-attention of the CLS token on the heads of the last layer. Crucially, no labels are used during the self-supervised training. These maps demonstrate that the learned class-specific features lead to remarkable unsupervised segmentation masks and visibly correlate with the shape of semantic objects in the images.
1: Self-Supervised Vision Transformers with DINO https://arxiv.org/abs/2104.14294
Large Language Models
Language + Vision Models
Language + Vision Models
More on Transformer Code
Follow Sasha Rush’s tutorial: https://nlp.seas.harvard.edu/annotated-transformer/
Questions?