1 of 35

Learning Transferable Visual Models From Natural Language Supervision

Issam Laradji

UBC Alumni

Senior Research Scientist at ServiceNow

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

1

2 of 35

Agenda

  • What is Zero-Shot Learning?

  • How does CLIP work?

  • What is Contrastive Learning?

  • Does Contrastive Learning Help?

  • Robustness of CLIP

  • Limitations and Applications of CLIP

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

2

3 of 35

What is Zero-Shot Learning?

  • To recognize items that the model hasn’t seen before during training

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

3

4 of 35

Why Zero-Shot Learning?

  • Costly Datasets
    • took 25,000 workers for 14 million images of imagenet for 22,000 object categories.

  • Narrow set of classes can be learnt as fully supervised
    • An ImageNet model is good at predicting the 1000 ImageNet categories
  • all it can do “out of the box.”

- New datasets + new output heads + fine-tuning needed for new classes

  • Zero-Shot Model can perform a wide variety of tasks “out of the box” examples

  • Poor real-world performance
    • In 2015, Microsoft trained a model that surpasses human top-5 accuracy
    • In the wild, their performance is far lower
    • Zero-Shot models seem more robust

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

4

5 of 35

OpenAI introduced CLIP [ICML 2021]

  • Learns visual concepts from natural language supervision

  • Authors:
    • Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

5

6 of 35

OpenAI introduced CLIP [ICML 2021]

  • Any visual classification benchmark
    • Just provide the names of the categories

  • Leads to “zero-shot” capabilities of GPT-2 and GPT-3

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

6

7 of 35

OpenAI introduced CLIP [ICML 2021]

  • Any visual classification benchmark
    • Just provide the names of the categories

  • Leads to “zero-shot” capabilities of GPT-2 and GPT-3

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

7

8 of 35

OpenAI introduced CLIP [ICML 2021]

  • Any visual classification benchmark
    • Just provide the names of the categories

  • Leads to “zero-shot” capabilities of GPT-2 and GPT-3

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

8

9 of 35

OpenAI introduced CLIP [ICML 2021]

  • Any visual classification benchmark
    • Just provide the names of the categories

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

9

10 of 35

Contrastive Learning

  • Resulting embeddings should
    • minimize distance between positive examples, maximize between negative examples
    • Distance metric is usually Euclidean
    • High quality representations

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

10

11 of 35

Contrastive Learning

  • Resulting embeddings should
    • minimize distance between positive examples, maximize between negative examples
    • Distance metric is usually Euclidean
    • High quality representations

A photo of a dog

painting by numbers

knitting a scarf

a photo of a racoon

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

11

12 of 35

Zero-shot image classification

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

12

13 of 35

Zero-shot image classification

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

13

14 of 35

Zero-shot image classification

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

14

15 of 35

Zero-shot image classification

Bag of Words

or

Transformer

*W1

*W2

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

15

16 of 35

CLIP: Contrastive Image Pre-training

Dataset

400 million image/text pairs collected

Architecture

ResNet-50 & ViT-B

Training

Maximize cosine similarity of N real pairs

Minimizing cosine similarity of the of the

N^2 − N incorrect pairings

Batch size 32,768

32 epochs

12 days to train with 256 V100 GPUs

Half-precision Adam

32,768 x 32,768

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

16

17 of 35

CLIP: Contrastive Image Pre-training

Dataset

400 million image/text pairs collected

Architecture

ResNet-50 & ViT-B

Training

Maximize cosine similarity of N real pairs

Minimizing cosine similarity of the of the

N^2 − N incorrect pairings

Batch size 32,768

32 epochs

12 days to train with 256 V100 GPUs

Half-precision Adam

32,768 x 32,768

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

17

18 of 35

CLIP: Contrastive Image Pre-training

Dataset

400 million image/text pairs collected

Architecture

ResNet-50 & ViT-B

Training

Maximize cosine similarity of N real pairs

Minimizing cosine similarity of the of the

N^2 − N incorrect pairings

Batch size 32,768

32 epochs

12 days to train with 256 V100 GPUs

Half-precision Adam

32,768 x 32,768

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

18

19 of 35

Contrastive Learning Helps

Baseline 1: predict exact caption

Baseline 2: predict Bag of Words

CLIP: Bag of Words + Contrastive

vs.

Baselines 1 and 2

CLIP

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

19

20 of 35

Zero-shot CLIP is much more robust

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

20

21 of 35

Use CLIP for your Datasets

Logistic regression classifier on image features

- L-BFGS

- Only one hyperparameter “C”

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

21

22 of 35

Use CLIP for your Datasets

Evaluated on 27 image datasets × 65 vision models

satellite images, car models , medical images, city classification, rendered texts, aircrafts , birds, memes

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

22

23 of 35

Use CLIP for your Datasets

Evaluated on 27 image datasets × 65 vision models

satellite images, car models , medical images, city classification, rendered texts, aircrafts , birds, memes

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

23

24 of 35

Use CLIP for your Datasets

Evaluated on 27 image datasets × 65 vision models

satellite images, car models , medical images, city classification, rendered texts, aircrafts , birds, memes

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

24

25 of 35

Use CLIP for your Datasets

Linear Probe on ResNet

Zero-Shot Learning with CLIP

ResNet

Linear Model

Label

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

25

26 of 35

Use CLIP for your Datasets

Evaluated on 27 image datasets × 65 vision models

satellite images, car models , medical images, city classification, rendered texts, aircrafts , birds, memes

Trains only Logistic Regression

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

26

27 of 35

CLIP’s features are more robust to task shift

More robust compared to models pre-trained on ImageNet.

Transfer scores of linear probes trained on the representations of CLIP models are higher than other pretrained models

Representations of models (like ResNet) trained on ImageNet somewhat overfit to their task.

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

27

28 of 35

CLIP Zero-Shot Performs really well

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

28

29 of 35

Prompt Engineering

Performance is sensitive based on the class description

Contextless Label: cat

Templates:

“A photo of a {label}, a type of {meta-label}”

“A photo of a {label}”

“A photo of a {big/small} {label}

Ensemble:

Average over the embeddings over these prompts

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

29

30 of 35

Limitations

Struggles on more abstract or systematic tasks

counting the number of objects

predicting how close the nearest car is in a photo

Poor generalization to images not covered in its pre-training dataset

ON MNIST dataset, zero-shot CLIP only achieves 88%

Very Sensitive to prompt engineering.

“A photo of a {label}”

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

30

31 of 35

Limitations

Easy to attack

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

31

32 of 35

Applications of CLIP

StyleCLIP

(Patashnik et al.)

Steering a GAN Using CLIP

CLIP4Clip

(Luo & Ji, et al.)

Video retrieval using CLIP features

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

32

33 of 35

Text-based image generations using CLIP

“A banquet hall”

“Geoffrey Hinton”

“Dogs playing poker”

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

33

34 of 35

Zero-shot tracking

Extract proposals using a generic proposal network

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

34

35 of 35

Conclusion

Overview

It is based on Learning to Caption Images

Contrastive learning formulation is powerful

Strengths:

Strong Zero-Shot performance

Robust against different types of augmentations

Limitations:

Sensitive to Prompt Engineering

Poor generalization for new types of images and tasks

Similarity-based learning

© 2022 ServiceNow, Inc. All Rights Reserved. Confidential.

35