1 of 35

Contrastive Pre-training�for unsupervised representation learning

Oğuz Şerbetçi

2 of 35

Why Representation Learning

Better representations:

→ Faster, easier and more data efficient machine learning

→ Transfer learning

Figure: https://medium.com/@sukritipaul005/an-iisc-lecture-deep-learning-research-representation-learning-62176d3bc66

3 of 35

Deep Learning = Representation Learning

Learn a supervised image classifier with labeled data.

→ Transfer Learning: Use activations as representation for another task or data.

Image: https://commons.wikimedia.org/wiki/File:Collage_of_Nine_Dogs.jpg

dog

4 of 35

Deep Learning = Representation Learning

Image: Zeiler et al. 2013, Visualizing and Understanding Convolutional Networks

5 of 35

Unsupervised Representation Learning

with Generative Models

Generative models

! Reconstructing every detail in the data is usually required.

Image: https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html

6 of 35

Unsupervised Representation Learning

with Pretext Tasks

Word2Vec and recent attention methods use raw text and a pretext task to learn representations.

Images: Mikolov et al. 2013, Distributed Representations of Words and Phrases and their Compositionality

7 of 35

Unsupervised Representation Learning

with Pretext Tasks

Image colouring

jigsaw puzzle

Images: https://richzhang.github.io/colorization/

8 of 35

Supervised vs Unsupervised

labels

context

9 of 35

Representation learning with �Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, Oriol Vinyals

10 of 35

Predictive Coding

11 of 35

Contrastive Predictive Coding

Compress high dimensional input into a compact latent embedding space.�
Autoregressive model predicts in this embedding space.�
A scoring function and negative sampling for the loss function.

12 of 35

We want to maximize mutual information between input and the context.

13 of 35

With monte carlo sample we can maximize

14 of 35

Contrastive Predictive Coding

Let’s have a scoring function

where

This score is the exponential of the dot product between �the observed embedding and the predicted embedding for step k.

Finally, we optimize this scoring function with negative sampling:

Prediction of the model

15 of 35

InfoNCE Loss

Mini-batch makes up of one correct and a set of negative samples.
Intuition behind NCE:

Learn a model that can discriminate (“contrast”) data distribution P to noise Q.
Does away with normalization constant of the distribution!

More negative samples → better estimation of the distribution

16 of 35

Experiments in 4 domains

Audio, Image, Text

Unsupervised Pre-training + Linear supervised classifier

State Representations for RL

Auxiliary loss to training loss

17 of 35

Image

Every row shows

image patches that

activate a certain neuron

in the CPC architecture.

Context model is PixelCNN

18 of 35

Image

Improves on other unsupervised methods such as colorization, jigsaw, GAN

19 of 35

Audio

Selection of negative samples is important: Negative samples from same speaker performs best.
From 2, 4, 8, 12, 16 steps to be predicted 12 worked out the best (8, 16 are close)
Phone classification performance is close to supervised when using MLP classifier

Model is not good enough to surface linear features? Or is this task inherently more complex than image?

20 of 35

Text

Encoder (1D-convolution + ReLU + mean-pooling) for sentence embedding.
Decoder (GRU) to predict up to 3 future sentences.

No word-level generation!

21 of 35

Reinforcement Learning

A2C with an auxiliary contrastive loss from up to 30 steps prediction.
Only a linear prediction head is added to baseline architecture. → simple addition

No need for memory,

a reactive policy solves the task

22 of 35

Summary

A very general approach with no domain specific pretext-tasks, e.g. jigsaw puzzle, colouring, rotation prediction.
Sampling policy for negative samples and the number of predictions are hyper parameters.

23 of 35

Data-Efficient Image Recognition with Contrastive Predictive Coding

Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, �Aaron van den Oord

24 of 35

CPC-v2

Buying data efficiency with compute:

28 million → 305 million parameters

Pretrain on complete Image-Net
Fine-tune on subsample

💸

🌏

🔥

25 of 35

CPC-v2

Model Capacity
Bottom-up predictions
Layer Norm
Augmentation: color dropping
Horizontal predictions
Large patches
More patch augmentations

26 of 35

Talking points

Inductive biases introduced about content-preserving transformations.
Fine-tuning works better than fixed representations.

27 of 35

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen, Simon Kornblith,

Mohammad Norouzi, Geoffrey Hinton

28 of 35

SimCLR: Contrastive Learning with only data augmentations

Augmentations instead of context.

X16 params

💸

🌏

🔥

29 of 35

Maximize agreement between two views of the image.
Composing multiple data augmentation is crucial.
Normalized embeddings + adjusted temperature in in Softmax for CE loss
Large model & batch size 4096
Contrastive training is done with z, an MLP over embeddings h.

30 of 35

Momentum Contrast for Unsupervised Visual Representation Learning

Kaiming He, Haoqi Fan, Yuxin Wu,

Saining Xie, Ross Girshick

31 of 35

Memory Bank for the Negative Samples

Store mini-batches for later use as negative samples.

BUT embeddings change fast during training.

→ Use Momentum updates with a second encoder:

Have another encoder that updates slower, which makes negative embeddings “smoother”.

→ Larger batch-size → more negative samples: 65.536

Figure: Chen et al. 2020, Improved Baselines with Momentum Contrastive Learning

grad

momentum

update

32 of 35

Transfer Learning

Table: Chen et al. 2020, Momentum Contrast for Unsupervised Visual Representation Learning

33 of 35

Augmentations and MLP from SimCLR

Figure & results: Chen et al. 2020, Improved Baselines with Momentum Contrastive Learning

34 of 35

More papers:

Contrastive Multiview Coding:� https://arxiv.org/abs/1906.05849

Learning Representations by �Maximizing Mutual Information Across Views:�https://arxiv.org/abs/1906.00910

Self-Supervised Learning of Pretext-Invariant �Representations: https://arxiv.org/abs/1912.01991

35 of 35

Thanks!