1 of 35

Contrastive Pre-training�for unsupervised representation learning

Oğuz Şerbetçi

2 of 35

Why Representation Learning

Better representations:

→ Faster, easier and more data efficient machine learning

→ Transfer learning

3 of 35

Deep Learning = Representation Learning

Learn a supervised image classifier with labeled data.

→ Transfer Learning: Use activations as representation for another task or data.

dog

4 of 35

Deep Learning = Representation Learning

5 of 35

Unsupervised Representation Learning

with Generative Models

Generative models

! Reconstructing every detail in the data is usually required.

6 of 35

Unsupervised Representation Learning

with Pretext Tasks

Word2Vec and recent attention methods use raw text and a pretext task to learn representations.

7 of 35

Unsupervised Representation Learning

with Pretext Tasks

Image colouring

jigsaw puzzle

8 of 35

Supervised vs Unsupervised

labels

context

9 of 35

Representation learning with �Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, Oriol Vinyals

10 of 35

Predictive Coding

11 of 35

Contrastive Predictive Coding

  • Compress high dimensional input into a compact latent embedding space.�
  • Autoregressive model predicts in this embedding space.�
  • A scoring function and negative sampling for the loss function.

12 of 35

  • We want to maximize mutual information between input and the context.

13 of 35

  • With monte carlo sample we can maximize

14 of 35

Contrastive Predictive Coding

  • Let’s have a scoring function

where

  • This score is the exponential of the dot product between �the observed embedding and the predicted embedding for step k.

  • Finally, we optimize this scoring function with negative sampling:

Prediction of the model

15 of 35

InfoNCE Loss

  • Mini-batch makes up of one correct and a set of negative samples.
  • Intuition behind NCE:
    • Learn a model that can discriminate (“contrast”) data distribution P to noise Q.
    • Does away with normalization constant of the distribution!
  • More negative samples → better estimation of the distribution

16 of 35

Experiments in 4 domains

  • Audio, Image, Text
    • Unsupervised Pre-training + Linear supervised classifier
  • State Representations for RL
    • Auxiliary loss to training loss

17 of 35

Image

Every row shows

image patches that

activate a certain neuron

in the CPC architecture.

Context model is PixelCNN

18 of 35

Image

  • Improves on other unsupervised methods such as colorization, jigsaw, GAN

19 of 35

Audio

  • Selection of negative samples is important: Negative samples from same speaker performs best.
  • From 2, 4, 8, 12, 16 steps to be predicted 12 worked out the best (8, 16 are close)
  • Phone classification performance is close to supervised when using MLP classifier
    • Model is not good enough to surface linear features? Or is this task inherently more complex than image?

20 of 35

Text

  • Encoder (1D-convolution + ReLU + mean-pooling) for sentence embedding.
  • Decoder (GRU) to predict up to 3 future sentences.
    • No word-level generation!

21 of 35

Reinforcement Learning

  • A2C with an auxiliary contrastive loss from up to 30 steps prediction.
  • Only a linear prediction head is added to baseline architecture. → simple addition

No need for memory,

a reactive policy solves the task

22 of 35

Summary

  • A very general approach with no domain specific pretext-tasks, e.g. jigsaw puzzle, colouring, rotation prediction.
  • Sampling policy for negative samples and the number of predictions are hyper parameters.

23 of 35

Data-Efficient Image Recognition with Contrastive Predictive Coding

Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, �Aaron van den Oord

24 of 35

CPC-v2

Buying data efficiency with compute:

28 million → 305 million parameters

  • Pretrain on complete Image-Net
  • Fine-tune on subsample

💸

🌏

🔥

25 of 35

CPC-v2

  • Model Capacity
  • Bottom-up predictions
  • Layer Norm
  • Augmentation: color dropping
  • Horizontal predictions
  • Large patches
  • More patch augmentations

26 of 35

Talking points

  • Inductive biases introduced about content-preserving transformations.
  • Fine-tuning works better than fixed representations.

27 of 35

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith,

Mohammad Norouzi, Geoffrey Hinton

28 of 35

SimCLR: Contrastive Learning with only data augmentations

Augmentations instead of context.

X16 params

💸

🌏

🔥

29 of 35

  • Maximize agreement between two views of the image.
  • Composing multiple data augmentation is crucial.
  • Normalized embeddings + adjusted temperature in in Softmax for CE loss
  • Large model & batch size 4096
  • Contrastive training is done with z, an MLP over embeddings h.

30 of 35

Momentum Contrast for Unsupervised Visual Representation Learning

Kaiming He, Haoqi Fan, Yuxin Wu,

Saining Xie, Ross Girshick

31 of 35

Memory Bank for the Negative Samples

Store mini-batches for later use as negative samples.

BUT embeddings change fast during training.

→ Use Momentum updates with a second encoder:

Have another encoder that updates slower, which makes negative embeddings “smoother”.

→ Larger batch-size → more negative samples: 65.536

grad

momentum

update

32 of 35

Transfer Learning

33 of 35

  • Augmentations and MLP from SimCLR

34 of 35

More papers:

Contrastive Multiview Coding:� https://arxiv.org/abs/1906.05849

Learning Representations by �Maximizing Mutual Information Across Views:�https://arxiv.org/abs/1906.00910

Self-Supervised Learning of Pretext-Invariant �Representations: https://arxiv.org/abs/1912.01991

35 of 35

Thanks!