Vision Transformers
CS5670: Computer Vision
Announcements
Readings
Recall: ConvNets
Elgendy, Deep Learning for Vision Systems, https://livebook.manning.com/book/grokking-deep-learning-for-computer-vision/chapter-5/v-3/
ConvNets assume spatial locality
An alternative to convolution: Attention
An alternative to convolution: Attention
Step 1: Break image into patches
An alternative to convolution: Attention
Step 1: Break image into patches
An alternative to convolution: Attention
Step 2: Map each patch to three vectors:
Query (Q), Key (K), and Value (V)
…
…
…
linear mappings
Q1
K1
V1
An alternative to convolution: Attention
…
…
…
Q1
linear mappings
K1
V1
…
…
…
Q5
K5
V5
Step 3: For each patch, compare its query vector to all key vectors
0.2
An alternative to convolution: Attention
Step 3: For each patch, compare its query vector to all key vectors
An alternative to convolution: Attention
Step 4: Compute weighted sum of value vectors
New vector y1
y1
An alternative to convolution: Attention
Step 5: Repeat for all patches
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
y11
y12
y13
y14
y15
y16
An alternative to convolution: Attention
Result: we’ve transformed all of the input patches into new vectors, by comparing vectors derived from all pairs of patches
This operation is called attention – the network can choose, for each patch, which other patches to attend to (i.e., give high weight to)
Unlike convolution, a patch is allowed to talk to the entire image
Attention is a set-to-set operation – it is equivariant to permuting the patches
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
y11
y12
y13
y14
y15
y16
An alternative to convolution: Attention
Parameters: weight matrices Wq, Wk, Wv that map input patches to query, key, and value vectors
y1
y2
y3
y4
y5
y6
y7
y8
y9
y10
y11
y12
y13
y14
y15
y16
Details
Transformers
Vision Transformer (ViT)
ICLR 2021
Vision Transformer (ViT)
How is the output class computed?
Vision Transformer (ViT)
How is the output class computed?
At the time, outperformed CNN-based approaches on image classification tasks
Vision Transformer (ViT)
DPT: Dense Prediction Transformers�[Ranftl et al., 2021]
DPT architecture
Reassemble operation
Fusion operation
DPT: Depth prediction results
Input
MiDaS (CNN-based)
DPT (Transformer)
DPT: Attention maps
Input
Depth prediction
Attention maps for upper right corner
Attention maps for lower right corner
Questions?
Other types of transformers
Autoregressive transformer methods
“Who was the 16th President of the United States?”
Who
was
the
…
States
{
“The Encoder”
“Who was the 16th President of the United States?”
Who
was
the
…
States
{
“The Encoder”
Condensed input representation
“Who was the 16th President of the United States?”
Who
was
the
…
States
{
“The Encoder”
{
“The Decoder”
[Empty]
“Who was the 16th President of the United States?”
Who
was
the
…
States
{
“The Encoder”
{
“The Decoder”
[Empty]
Abraham
Token is fed back into the decoder, next token predicted, etc. Some randomness is used to select each next token.
Lincoln
Also called a “sequence to sequence model”
Also works for images
Also works for images
Idea: generate an image by producing each (tokenized) patch in raster-scan order
Parti Text-to-image model
Cross-attention between text prompt and generated image patches
“A frog reading a newspaper”
A
frog
reading
…
{
“The Encoder”
{
“The Decoder”
Text comes in:
Image patches come out in raster order
Autoregressive models
Non-autoregressive models
Large View Synthesis Model (LVSM)
Input Views &
Their Poses
LVSM
Synthesized
Target View
Target View Pose
…
…
…
`
Target Tokens
Input
Tokens
Updated
Target Tokens
Large View Synthesis Model (LVSM)
Input Views &
Their Poses
LVSM
Synthesized
Target View
Target View Pose
…
…
…
`
Target Tokens
Input
Tokens
Updated
Target Tokens
Big transformer model
Tokenize & Detokenize
Input Views &
Plücker Rays
Target View
Plücker Rays
…
Synthesized
Target View
Project
&
Unpatchify
…
…
Patchify
&
Project
Patchify
&
Project
Input
Tokens
Target Tokens
LVSM
`
Updated
Target Tokens
LVSM Results
CNNs vs Transformers
Questions?