1 of 41

Demystifying Self-Supervised Learning for Visual Recognition

Sayak Paul (@RisingSayak)

SciPy Japan 2020

2 of 41

$whoami

I call model.fit() @ PyImageSearch
I contribute to TensorFlow Hub
Netflix Nerd 👀
My coordinates are here - sayak.dev

3 of 41

Agenda

Representation learning
Self-supervised learning for computer vision
Some notable self-supervised frameworks (MoCo, SimCLR, SwAV)
A recipe for regular deep learning tasks in computer vision
Self-training as opposed to self-supervision
QA

4 of 41

Representation learning

Introduction
Success
Challenges

5 of 41

Representation learning ftw!

Training models to learn representations for tasks like image classification, object detection, semantic segmentation, and so on.

Large pool of data

Train a model to learn ...

Extract the learned representations

Use the representations in a downstream task

6 of 41

The unreasonable success of supervised representation learning

Source: EfficientNet; Tan et al. (2019)

7 of 41

But ...

Often, this success is constrained by

Source: Big Transfer (BiT); Kolesnikov et al. (2020)

The amount of labeled data for pre-training

8 of 41

But ...

Often, this success is constrained by

The amount of labeled data for pre-training

Length of training time

Source: Big Transfer (BiT); Kolesnikov et al. (2020)

9 of 41

But ...

Gathering large amount of labeled data

Is costly
Can be faulty

10 of 41

Self-supervised learning - Intro

Constructing a supervised signal from unlabeled data
Training models on that signal
Using representations learned by these models for downstream tasks
These supervised formulations are known as pretext tasks

11 of 41

Pretext task - Examples

Predicting the next word from a sequence of words (sounds familiar?)
Predicting a masked word in a sentence (sounds familiar?)
Predicting the angle of rotations in images
Predicting the next frame in a video
Filling out missing pixels in images (image inpainting)

12 of 41

Self-supervised learning - Adaptation

Self-supervised learning has been dominating NLP for a while now - Word2Vec, ULMFit, ELMO, BERT, etc.
Computer vision was yet to see this revolution until now ...

13 of 41

Self-supervised learning for computer vision

Problem formulation
General workflow
Typical loss functions

14 of 41

Problem formulation

Instilling a sense of semantic understanding in a model -

Predicting the angle from image rotations
Predicting the next frame in a video
Putting patches of images in the right order
Different views of the same images

For a visual overview checkout The Illustrated Self-Supervised Learning

15 of 41

Problem formulation

Contrasting between different views of the same images works very well!

Source: MoCo-V2 in PyTorch

16 of 41

Problem formulation

Why does this formulation matter?

The model learns the content that makes two images visually different i.e. a cat and a mountain.
This instills a sense of semantic understanding in the model.

In literature, this formulation is referred to as contrastive learning.

17 of 41

General workflow

Start with a pretext task.

Contrastive learning-based paradigms work better than others.

Train a model (typically ResNet50) with the pretext task as the training objective.
Use the feature backbone to transfer to downstream tasks.

Source: GitHub repositories of SwAV and SimCLR

18 of 41

Typical loss functions

NT-Xent loss
Info-NCE loss

19 of 41

Some notable self-supervised frameworks in vision

MoCo-V2, SimCLR, SwAV
Evaluation

20 of 41

SimCLR (Chen et al.)

Source: SimCLR; Chen et al. (2020)

Form different views with data augmentation techniques.
Contrast different views of images with NT-Xent loss.
Pull together the views coming from the same images.

Relies on pretty large batch size for negative samples.
Does not rigorously ensure two similar images would be closer in the feature space.
Computationally heavy.

21 of 41

Moco-V2 (Chen et al.)

Source: MoCo-V2; Chen et al. (2020)

Builds on top of SimCLR.
To allow enough negative samples maintain a (large enough) queue.
Contrast between the representations from a query and key encoder with InfoNCE loss.
Key encoder updated with the momentum update rule to maintain consistent representations.

Requires a large enough queue to be maintained.
Requires a separate momentum encoder for maintaining consistent representations.

22 of 41

SwAV to rule’em all (Caron et al.)!

Source: SwAV; Caron et al. (2020)

Operates on cluster assignments instead feature-wise comparisons.
Increase the view comparisons with multi-crop.

Encourages views of similar images to get mapped to same cluster assignments with a swapped prediction problem.

23 of 41

The frameworks are all over the place!

Base image: SwAV; Caron et al. (2020)

24 of 41

Recipes that show promise

Data augmentation operations - random resized crops, color distortions, Gaussian blur, horizontal flips
Cosine decay schedule
Relatively larger batch sizes
ResNet50 as the backbone
Contrastive formulation

25 of 41

Evaluation - Overview

Linear evaluation

Freeze the feature backbone and learn a linear classifier

Fine-tuning with 1% and 10% labeled data

26 of 41

Evaluation - Numbers

Source: SwAV; Caron et al. (2020)

27 of 41

It’s not just image classification

Source: SwAV; Caron et al. (2020)

28 of 41

A recipe to consider in vision these days

Don’t have enough labeled data?
What to do with extra data?

29 of 41

Don’t have enough labeled data?

Gather unlabeled data. Sometimes it’s way cheaper, sometimes it’s not (healthcare).
Use these frameworks to capture effective representations.
Fine-tune with the “not-enough” labeled data for the downstream task.

30 of 41

Even if you have enough labeled data

Consider using self-supervised models as feature extractors (remember embeddings?).
These frameworks are worth giving a try since they beat their supervised counterparts in Object Detection.

31 of 41

Final thoughts

Challenges in self-supervised learning for vision
Self-training vs. self-supervised learning

32 of 41

Challenges

Requires a large pool of unlabeled data
Requires longer pre-training
Sophisticated hyperparameter tuning process

33 of 41

Self-training as another consideration

Source: Noisy student training (an extension of self-training); Xie et al. (2019)

34 of 41

Why this expansion of self-training?

More label efficiency

Source: Rethinking Pre-training and Self-training; Zoph et al. (2020)

35 of 41

Why this expansion of self-training?

More robustness

Source: Rethinking Pre-training and Self-training; Zoph et al. (2020)

36 of 41

Why this expansion of self-training?

More robustness

Source: Noisy student training; Xie et al. (2019)

37 of 41

Some recommended reading

Self-supervised visual feature learning with deep neural networks: A survey; Jing et al.
PIRL; Misra et al.
MoCo; He et al.
SimCLR; Chen et al.
SwAV; Caron et al.

38 of 41

Minimal implementations

SimCLR - github.com/sayakpaul/SimCLR-in-TensorFlow-2
SwAV - github.com/ayulockin/SwAV-TF
Supervised Contrastive Learning - github.com/sayakpaul/Supervised-Contrastive-Learning-in-TensorFlow-2

39 of 41

40 of 41

Deck available here: bit.ly/scipy-sp

41 of 41

Different areas for a model to optimize

Let’s get connected on Twitter! I am @RisingSayak.