1 of 44

Vision Transformers

CS5670: Computer Vision

2 of 44

Announcements

Final deliverables: Project 5 (Parts A & B, Part A due tomorrow)
Final exam, in class May 6

Course evaluations coming up, worth a small amount of extra credit

3 of 44

Readings

Szeliski 2^nd Edition, Chapter 5.5.3
Foundations of Computer Vision

https://visionbook.mit.edu/transformers.html

4 of 44

Recall: ConvNets

Elgendy, Deep Learning for Vision Systems, https://livebook.manning.com/book/grokking-deep-learning-for-computer-vision/chapter-5/v-3/

5 of 44

ConvNets assume spatial locality

Assume nearby pixels are more important to making decisions than far away pixels (an example of an “inductive bias”)
Only after stacking together several convolutional layers with spatial downsampling can distant pixels “talk” to each other
As image datasets grow, we can do better by removing the spatial locality assumption and learning how to process images from scratch

6 of 44

An alternative to convolution: Attention

Goal: consider long-range relationships between pixels

7 of 44

An alternative to convolution: Attention

Step 1: Break image into patches

8 of 44

An alternative to convolution: Attention

Step 1: Break image into patches

9 of 44

An alternative to convolution: Attention

Step 2: Map each patch to three vectors:

Query (Q), Key (K), and Value (V)

…

linear mappings

Q₁

K₁

V₁

10 of 44

An alternative to convolution: Attention

…

Q₁

linear mappings

K₁

V₁

…

Q₅

K₅

V₅

Step 3: For each patch, compare its query vector to all key vectors

0.2

11 of 44

An alternative to convolution: Attention

Step 3: For each patch, compare its query vector to all key vectors

12 of 44

An alternative to convolution: Attention

Step 4: Compute weighted sum of value vectors

New vector y₁

y₁

13 of 44

An alternative to convolution: Attention

Step 5: Repeat for all patches

y₁

y₂

y₃

y₄

y₅

y₆

y₇

y₈

y₉

y₁₀

y₁₁

y₁₂

y₁₃

y₁₄

y₁₅

y₁₆

14 of 44

An alternative to convolution: Attention

Result: we’ve transformed all of the input patches into new vectors, by comparing vectors derived from all pairs of patches

This operation is called attention – the network can choose, for each patch, which other patches to attend to (i.e., give high weight to)

Unlike convolution, a patch is allowed to talk to the entire image

Attention is a set-to-set operation – it is equivariant to permuting the patches

y₁

y₂

y₃

y₄

y₅

y₆

y₇

y₈

y₉

y₁₀

y₁₁

y₁₂

y₁₃

y₁₄

y₁₅

y₁₆

15 of 44

An alternative to convolution: Attention

Parameters: weight matrices W_q, W_k, W_v that map input patches to query, key, and value vectors

y₁

y₂

y₃

y₄

y₅

y₆

y₇

y₈

y₉

y₁₀

y₁₁

y₁₂

y₁₃

y₁₄

y₁₅

y₁₆

16 of 44

Details

Rather than working with raw RGB image patches, the patches can themselves be features (e.g., produced by a linear mapping from RGB patches, or the output of a CNN). (These vectors are often called tokens)
The feature vectors produced by the attention layer are often passed through an MLP (adding more parameters to the system)
Each patch can be combined with a positional encoding indicating the spatial location of the patch, enabling spatial reasoning
Instead of single W_q, W_k, W_v weight matrices, multiple linear mappings can be learned for an attention layer, and the resulting features concatenated (multi-headed attention)

17 of 44

Transformers

Just like any network layer, we can stack attention layers – the output of one becomes the input to the next – to form a bigger network, called a transformer
Transformers are very large, powerful learners that transcend convolutional networks by representing a larger class of functions

18 of 44

Vision Transformer (ViT)

The network defined so far is designed for image classification, and roughly follows:

ICLR 2021

19 of 44

Vision Transformer (ViT)

How is the output class computed?

20 of 44

Vision Transformer (ViT)

How is the output class computed?

At the time, outperformed CNN-based approaches on image classification tasks

21 of 44

Vision Transformer (ViT)

Note: this is just one possible approach – lots of others variants of transformers for vision task exist!
(For instance, combinations of transformers and CNNs)

22 of 44

DPT: Dense Prediction Transformers�[Ranftl et al., 2021]

Predicts an image-shaped output (e.g., segmentation map or depth map) from an image-shaped input

DPT architecture

Reassemble operation

Fusion operation

23 of 44

DPT: Depth prediction results

Input

MiDaS (CNN-based)

DPT (Transformer)

24 of 44

DPT: Attention maps

Input

Depth prediction

Attention maps for upper right corner

Attention maps for lower right corner

25 of 44

Questions?

26 of 44

Other types of transformers

27 of 44

Autoregressive transformer methods

Common for LLMs
Outputs are generated one by one, where each output (word, patch, etc) is fed back to the network to generate the next

28 of 44

“Who was the 16^th President of the United States?”

Who

was

the

…

States

{

“The Encoder”

29 of 44

“Who was the 16^th President of the United States?”

Who

was

the

…

States

{

“The Encoder”

Condensed input representation

30 of 44

“Who was the 16^th President of the United States?”

Who

was

the

…

States

{

“The Encoder”

{

“The Decoder”

[Empty]

31 of 44

“Who was the 16^th President of the United States?”

Who

was

the

…

States

{

“The Encoder”

{

“The Decoder”

[Empty]

Abraham

Token is fed back into the decoder, next token predicted, etc. Some randomness is used to select each next token.

Lincoln

Also called a “sequence to sequence model”

32 of 44

Also works for images

33 of 44

Also works for images

Idea: generate an image by producing each (tokenized) patch in raster-scan order

34 of 44

Parti Text-to-image model

Cross-attention between text prompt and generated image patches

35 of 44

“A frog reading a newspaper”

A

frog

reading

…

{

“The Encoder”

{

“The Decoder”

Text comes in:

Image patches come out in raster order

36 of 44

Autoregressive models

Why output tokens one by one instead of all at once?
Answer: autoregressive models are better for generative tasks like text-to-image where there are many correct answer
(Diffusion models are also good for such tasks)
Autoregressive methods can lead to more diverse, consistent, and high-quality outputs in generative tasks
(An important component in the process is some randomness.)

37 of 44

Non-autoregressive models

Output all tokens at once (e.g., DPT)
Well-suited for when the output is more deterministic (“one correct answer”)
Example: view interpolation

38 of 44

Large View Synthesis Model (LVSM)

Input Views &

Their Poses

LVSM

Synthesized

Target View

Target View Pose

…

`

Target Tokens

Input

Tokens

Updated

Target Tokens

39 of 44

Large View Synthesis Model (LVSM)

Input Views &

Their Poses

LVSM

Synthesized

Target View

Target View Pose

…

`

Target Tokens

Input

Tokens

Updated

Target Tokens

Big transformer model

40 of 44

Tokenize & Detokenize

Input Views &

Plücker Rays

Target View

Plücker Rays

…

Synthesized

Target View

Project

&

Unpatchify

…

Patchify

&

Project

Patchify

&

Project

Input

Tokens

Target Tokens

LVSM

`

Updated

Target Tokens

41 of 44

LVSM Results

42 of 44

CNNs vs Transformers

Transformers have been thought of as more scalable and more modern

43 of 44

44 of 44

Questions?