Learning Transferable Visual Models From Natural Language Supervision
Issam Laradji
UBC Alumni
Senior Research Scientist at ServiceNow
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
1
Agenda
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
2
What is Zero-Shot Learning?
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
3
Why Zero-Shot Learning?
- New datasets + new output heads + fine-tuning needed for new classes
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
4
OpenAI introduced CLIP [ICML 2021]
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
5
OpenAI introduced CLIP [ICML 2021]
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
6
OpenAI introduced CLIP [ICML 2021]
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
7
OpenAI introduced CLIP [ICML 2021]
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
8
OpenAI introduced CLIP [ICML 2021]
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
9
Contrastive Learning
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
10
Contrastive Learning
A photo of a dog
painting by numbers
knitting a scarf
a photo of a racoon
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
11
Zero-shot image classification
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
12
Zero-shot image classification
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
13
Zero-shot image classification
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
14
Zero-shot image classification
Bag of Words
or
Transformer
*W1
*W2
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
15
CLIP: Contrastive Image Pre-training
Dataset
400 million image/text pairs collected
Architecture
ResNet-50 & ViT-B
Training
Maximize cosine similarity of N real pairs
Minimizing cosine similarity of the of the
N^2 − N incorrect pairings
Batch size 32,768
32 epochs
12 days to train with 256 V100 GPUs
Half-precision Adam
32,768 x 32,768
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
16
CLIP: Contrastive Image Pre-training
Dataset
400 million image/text pairs collected
Architecture
ResNet-50 & ViT-B
Training
Maximize cosine similarity of N real pairs
Minimizing cosine similarity of the of the
N^2 − N incorrect pairings
Batch size 32,768
32 epochs
12 days to train with 256 V100 GPUs
Half-precision Adam
32,768 x 32,768
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
17
CLIP: Contrastive Image Pre-training
Dataset
400 million image/text pairs collected
Architecture
ResNet-50 & ViT-B
Training
Maximize cosine similarity of N real pairs
Minimizing cosine similarity of the of the
N^2 − N incorrect pairings
Batch size 32,768
32 epochs
12 days to train with 256 V100 GPUs
Half-precision Adam
32,768 x 32,768
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
18
Contrastive Learning Helps
Baseline 1: predict exact caption
Baseline 2: predict Bag of Words
CLIP: Bag of Words + Contrastive
vs.
Baselines 1 and 2
CLIP
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
19
Zero-shot CLIP is much more robust
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
20
Use CLIP for your Datasets
Logistic regression classifier on image features
- L-BFGS
- Only one hyperparameter “C”
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
21
Use CLIP for your Datasets
Evaluated on 27 image datasets × 65 vision models
satellite images, car models , medical images, city classification, rendered texts, aircrafts , birds, memes
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
22
Use CLIP for your Datasets
Evaluated on 27 image datasets × 65 vision models
satellite images, car models , medical images, city classification, rendered texts, aircrafts , birds, memes
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
23
Use CLIP for your Datasets
Evaluated on 27 image datasets × 65 vision models
satellite images, car models , medical images, city classification, rendered texts, aircrafts , birds, memes
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
24
Use CLIP for your Datasets
Linear Probe on ResNet
Zero-Shot Learning with CLIP
ResNet
Linear Model
Label
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
25
Use CLIP for your Datasets
Evaluated on 27 image datasets × 65 vision models
satellite images, car models , medical images, city classification, rendered texts, aircrafts , birds, memes
Trains only Logistic Regression
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
26
CLIP’s features are more robust to task shift
More robust compared to models pre-trained on ImageNet.
Transfer scores of linear probes trained on the representations of CLIP models are higher than other pretrained models
Representations of models (like ResNet) trained on ImageNet somewhat overfit to their task.
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
27
CLIP Zero-Shot Performs really well
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
28
Prompt Engineering
Performance is sensitive based on the class description
Contextless Label: cat
Templates:
“A photo of a {label}, a type of {meta-label}”
“A photo of a {label}”
“A photo of a {big/small} {label}
Ensemble:
Average over the embeddings over these prompts
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
29
Limitations
Struggles on more abstract or systematic tasks
counting the number of objects
predicting how close the nearest car is in a photo
Poor generalization to images not covered in its pre-training dataset
ON MNIST dataset, zero-shot CLIP only achieves 88%
Very Sensitive to prompt engineering.
“A photo of a {label}”
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
30
Limitations
Easy to attack
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
31
Applications of CLIP
StyleCLIP
(Patashnik et al.)
Steering a GAN Using CLIP
CLIP4Clip
(Luo & Ji, et al.)
Video retrieval using CLIP features
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
32
Text-based image generations using CLIP
“A banquet hall”
“Geoffrey Hinton”
“Dogs playing poker”
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
33
Zero-shot tracking
Extract proposals using a generic proposal network
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
34
Conclusion
Overview
It is based on Learning to Caption Images
Contrastive learning formulation is powerful
Strengths:
Strong Zero-Shot performance
Robust against different types of augmentations
Limitations:
Sensitive to Prompt Engineering
Poor generalization for new types of images and tasks
Similarity-based learning
© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.
35