1 of 41

Demystifying Self-Supervised Learning for Visual Recognition

Sayak Paul (@RisingSayak)

SciPy Japan 2020

2 of 41

$whoami

  • I call model.fit() @ PyImageSearch
  • I contribute to TensorFlow Hub
  • Netflix Nerd 👀
  • My coordinates are here - sayak.dev

3 of 41

Agenda

  • Representation learning
  • Self-supervised learning for computer vision
  • Some notable self-supervised frameworks (MoCo, SimCLR, SwAV)
  • A recipe for regular deep learning tasks in computer vision
  • Self-training as opposed to self-supervision
  • QA

4 of 41

Representation learning

  • Introduction
  • Success
  • Challenges

5 of 41

Representation learning ftw!

Training models to learn representations for tasks like image classification, object detection, semantic segmentation, and so on.

Large pool of data

Train a model to learn ...

Extract the learned representations

Use the representations in a downstream task

6 of 41

The unreasonable success of supervised representation learning

Source: EfficientNet; Tan et al. (2019)

7 of 41

But ...

Often, this success is constrained by

Source: Big Transfer (BiT); Kolesnikov et al. (2020)

  • The amount of labeled data for pre-training

8 of 41

But ...

Often, this success is constrained by

  • The amount of labeled data for pre-training

  • Length of training time

Source: Big Transfer (BiT); Kolesnikov et al. (2020)

9 of 41

But ...

Gathering large amount of labeled data

  • Is costly
  • Can be faulty

10 of 41

Self-supervised learning - Intro

  • Constructing a supervised signal from unlabeled data
  • Training models on that signal
  • Using representations learned by these models for downstream tasks
  • These supervised formulations are known as pretext tasks

11 of 41

Pretext task - Examples

  • Predicting the next word from a sequence of words (sounds familiar?)
  • Predicting a masked word in a sentence (sounds familiar?)
  • Predicting the angle of rotations in images
  • Predicting the next frame in a video
  • Filling out missing pixels in images (image inpainting)

12 of 41

Self-supervised learning - Adaptation

  • Self-supervised learning has been dominating NLP for a while now - Word2Vec, ULMFit, ELMO, BERT, etc.
  • Computer vision was yet to see this revolution until now ...

13 of 41

Self-supervised learning for computer vision

  • Problem formulation
  • General workflow
  • Typical loss functions

14 of 41

Problem formulation

Instilling a sense of semantic understanding in a model -

  • Predicting the angle from image rotations
  • Predicting the next frame in a video
  • Putting patches of images in the right order
  • Different views of the same images

For a visual overview checkout The Illustrated Self-Supervised Learning

15 of 41

Problem formulation

Contrasting between different views of the same images works very well!

16 of 41

Problem formulation

Why does this formulation matter?

  • The model learns the content that makes two images visually different i.e. a cat and a mountain.
  • This instills a sense of semantic understanding in the model.

In literature, this formulation is referred to as contrastive learning.

17 of 41

General workflow

  • Start with a pretext task.
    • Contrastive learning-based paradigms work better than others.
  • Train a model (typically ResNet50) with the pretext task as the training objective.
  • Use the feature backbone to transfer to downstream tasks.

Source: GitHub repositories of SwAV and SimCLR

18 of 41

Typical loss functions

  • NT-Xent loss
  • Info-NCE loss

19 of 41

Some notable self-supervised frameworks in vision

  • MoCo-V2, SimCLR, SwAV
  • Evaluation

20 of 41

SimCLR (Chen et al.)

Source: SimCLR; Chen et al. (2020)

  • Form different views with data augmentation techniques.
  • Contrast different views of images with NT-Xent loss.
  • Pull together the views coming from the same images.
  • Relies on pretty large batch size for negative samples.
  • Does not rigorously ensure two similar images would be closer in the feature space.
  • Computationally heavy.

21 of 41

Moco-V2 (Chen et al.)

Source: MoCo-V2; Chen et al. (2020)

  • Builds on top of SimCLR.
  • To allow enough negative samples maintain a (large enough) queue.
  • Contrast between the representations from a query and key encoder with InfoNCE loss.
  • Key encoder updated with the momentum update rule to maintain consistent representations.
  • Requires a large enough queue to be maintained.
  • Requires a separate momentum encoder for maintaining consistent representations.

22 of 41

SwAV to rule’em all (Caron et al.)!

Source: SwAV; Caron et al. (2020)

  • Operates on cluster assignments instead feature-wise comparisons.
  • Increase the view comparisons with multi-crop.
  • Encourages views of similar images to get mapped to same cluster assignments with a swapped prediction problem.

23 of 41

The frameworks are all over the place!

Base image: SwAV; Caron et al. (2020)

24 of 41

Recipes that show promise

  • Data augmentation operations - random resized crops, color distortions, Gaussian blur, horizontal flips
  • Cosine decay schedule
  • Relatively larger batch sizes
  • ResNet50 as the backbone
  • Contrastive formulation

25 of 41

Evaluation - Overview

  • Linear evaluation
    • Freeze the feature backbone and learn a linear classifier
  • Fine-tuning with 1% and 10% labeled data

26 of 41

Evaluation - Numbers

Source: SwAV; Caron et al. (2020)

27 of 41

It’s not just image classification

Source: SwAV; Caron et al. (2020)

28 of 41

A recipe to consider in vision these days

  • Don’t have enough labeled data?
  • What to do with extra data?

29 of 41

Don’t have enough labeled data?

  • Gather unlabeled data. Sometimes it’s way cheaper, sometimes it’s not (healthcare).
  • Use these frameworks to capture effective representations.
  • Fine-tune with the “not-enough” labeled data for the downstream task.

30 of 41

Even if you have enough labeled data

  • Consider using self-supervised models as feature extractors (remember embeddings?).
  • These frameworks are worth giving a try since they beat their supervised counterparts in Object Detection.

31 of 41

Final thoughts

  • Challenges in self-supervised learning for vision
  • Self-training vs. self-supervised learning

32 of 41

Challenges

  • Requires a large pool of unlabeled data
  • Requires longer pre-training
  • Sophisticated hyperparameter tuning process

33 of 41

Self-training as another consideration

Source: Noisy student training (an extension of self-training); Xie et al. (2019)

34 of 41

Why this expansion of self-training?

  • More label efficiency

Source: Rethinking Pre-training and Self-training; Zoph et al. (2020)

35 of 41

Why this expansion of self-training?

  • More robustness

Source: Rethinking Pre-training and Self-training; Zoph et al. (2020)

36 of 41

Why this expansion of self-training?

  • More robustness

Source: Noisy student training; Xie et al. (2019)

37 of 41

Some recommended reading

  • Self-supervised visual feature learning with deep neural networks: A survey; Jing et al.
  • PIRL; Misra et al.
  • MoCo; He et al.
  • SimCLR; Chen et al.
  • SwAV; Caron et al.

38 of 41

Minimal implementations

39 of 41

40 of 41

Deck available here: bit.ly/scipy-sp

41 of 41

Different areas for a model to optimize

Let’s get connected on Twitter! I am @RisingSayak.