1 of 44

Vision Transformers

CS5670: Computer Vision

2 of 44

Announcements

  • Final deliverables: Project 5 (Parts A & B, Part A due tomorrow)
  • Final exam, in class May 6

  • Course evaluations coming up, worth a small amount of extra credit

3 of 44

Readings

  • Szeliski 2nd Edition, Chapter 5.5.3
  • Foundations of Computer Vision

4 of 44

Recall: ConvNets

5 of 44

ConvNets assume spatial locality

  • Assume nearby pixels are more important to making decisions than far away pixels (an example of an “inductive bias”)
  • Only after stacking together several convolutional layers with spatial downsampling can distant pixels “talk” to each other
  • As image datasets grow, we can do better by removing the spatial locality assumption and learning how to process images from scratch

6 of 44

An alternative to convolution: Attention

  • Goal: consider long-range relationships between pixels

7 of 44

An alternative to convolution: Attention

Step 1: Break image into patches

8 of 44

An alternative to convolution: Attention

Step 1: Break image into patches

9 of 44

An alternative to convolution: Attention

Step 2: Map each patch to three vectors:

Query (Q), Key (K), and Value (V)

linear mappings

Q1

K1

V1

10 of 44

An alternative to convolution: Attention

Q1

linear mappings

K1

V1

Q5

K5

V5

Step 3: For each patch, compare its query vector to all key vectors

0.2

11 of 44

An alternative to convolution: Attention

Step 3: For each patch, compare its query vector to all key vectors

12 of 44

An alternative to convolution: Attention

Step 4: Compute weighted sum of value vectors

New vector y1

y1

13 of 44

An alternative to convolution: Attention

Step 5: Repeat for all patches

y1

y2

y3

y4

y5

y6

y7

y8

y9

y10

y11

y12

y13

y14

y15

y16

14 of 44

An alternative to convolution: Attention

Result: we’ve transformed all of the input patches into new vectors, by comparing vectors derived from all pairs of patches

This operation is called attention – the network can choose, for each patch, which other patches to attend to (i.e., give high weight to)

Unlike convolution, a patch is allowed to talk to the entire image

Attention is a set-to-set operation – it is equivariant to permuting the patches

y1

y2

y3

y4

y5

y6

y7

y8

y9

y10

y11

y12

y13

y14

y15

y16

15 of 44

An alternative to convolution: Attention

Parameters: weight matrices Wq, Wk, Wv that map input patches to query, key, and value vectors

y1

y2

y3

y4

y5

y6

y7

y8

y9

y10

y11

y12

y13

y14

y15

y16

16 of 44

Details

  • Rather than working with raw RGB image patches, the patches can themselves be features (e.g., produced by a linear mapping from RGB patches, or the output of a CNN). (These vectors are often called tokens)
  • The feature vectors produced by the attention layer are often passed through an MLP (adding more parameters to the system)
  • Each patch can be combined with a positional encoding indicating the spatial location of the patch, enabling spatial reasoning
  • Instead of single Wq, Wk, Wv weight matrices, multiple linear mappings can be learned for an attention layer, and the resulting features concatenated (multi-headed attention)

17 of 44

Transformers

  • Just like any network layer, we can stack attention layers – the output of one becomes the input to the next – to form a bigger network, called a transformer
  • Transformers are very large, powerful learners that transcend convolutional networks by representing a larger class of functions

18 of 44

Vision Transformer (ViT)

  • The network defined so far is designed for image classification, and roughly follows:

ICLR 2021

19 of 44

Vision Transformer (ViT)

How is the output class computed?

20 of 44

Vision Transformer (ViT)

How is the output class computed?

At the time, outperformed CNN-based approaches on image classification tasks

21 of 44

Vision Transformer (ViT)

  • Note: this is just one possible approach – lots of others variants of transformers for vision task exist!
  • (For instance, combinations of transformers and CNNs)

22 of 44

DPT: Dense Prediction Transformers�[Ranftl et al., 2021]

  • Predicts an image-shaped output (e.g., segmentation map or depth map) from an image-shaped input

DPT architecture

Reassemble operation

Fusion operation

23 of 44

DPT: Depth prediction results

Input

MiDaS (CNN-based)

DPT (Transformer)

24 of 44

DPT: Attention maps

Input

Depth prediction

Attention maps for upper right corner

Attention maps for lower right corner

25 of 44

Questions?

26 of 44

Other types of transformers

27 of 44

Autoregressive transformer methods

  • Common for LLMs
  • Outputs are generated one by one, where each output (word, patch, etc) is fed back to the network to generate the next

28 of 44

“Who was the 16th President of the United States?”

Who

was

the

States

{

“The Encoder”

29 of 44

“Who was the 16th President of the United States?”

Who

was

the

States

{

“The Encoder”

Condensed input representation

30 of 44

“Who was the 16th President of the United States?”

Who

was

the

States

{

“The Encoder”

{

“The Decoder”

[Empty]

31 of 44

“Who was the 16th President of the United States?”

Who

was

the

States

{

“The Encoder”

{

“The Decoder”

[Empty]

Abraham

Token is fed back into the decoder, next token predicted, etc. Some randomness is used to select each next token.

Lincoln

Also called a “sequence to sequence model”

32 of 44

Also works for images

33 of 44

Also works for images

Idea: generate an image by producing each (tokenized) patch in raster-scan order

34 of 44

Parti Text-to-image model

Cross-attention between text prompt and generated image patches

35 of 44

“A frog reading a newspaper”

A

frog

reading

{

“The Encoder”

{

“The Decoder”

Text comes in:

Image patches come out in raster order

36 of 44

Autoregressive models

  • Why output tokens one by one instead of all at once?
  • Answer: autoregressive models are better for generative tasks like text-to-image where there are many correct answer
  • (Diffusion models are also good for such tasks)
  • Autoregressive methods can lead to more diverse, consistent, and high-quality outputs in generative tasks
  • (An important component in the process is some randomness.)

37 of 44

Non-autoregressive models

  • Output all tokens at once (e.g., DPT)
  • Well-suited for when the output is more deterministic (“one correct answer”)
  • Example: view interpolation

38 of 44

Large View Synthesis Model (LVSM)

Input Views &

Their Poses

LVSM

Synthesized

Target View

Target View Pose

`

Target Tokens

Input

Tokens

Updated

Target Tokens

39 of 44

Large View Synthesis Model (LVSM)

Input Views &

Their Poses

LVSM

Synthesized

Target View

Target View Pose

`

Target Tokens

Input

Tokens

Updated

Target Tokens

Big transformer model

40 of 44

Tokenize & Detokenize

Input Views &

Plücker Rays

Target View

Plücker Rays

Synthesized

Target View

Project

&

Unpatchify

Patchify

&

Project

Patchify

&

Project

Input

Tokens

Target Tokens

LVSM

`

Updated

Target Tokens

41 of 44

LVSM Results

42 of 44

CNNs vs Transformers

  • Transformers have been thought of as more scalable and more modern

43 of 44

44 of 44

Questions?