1 of 73

Idea outline

Reminder: Contrastive Learning

Learn an embedding space in which similar sample pairs are close together.
Similar to unsupervised learning.
Good for learning good representations of unlabeled data, which can transfer into supervised formulations.

Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) (LeCun 2006)

Really early work in contrastive learning (clustering)
Shows that, yes, you can train similar points to be closer to similar points.

Deep Clustering for Unsupervised Learning of Visual Features (2019)

Algorithm clusters based on bag-of-features.
Encourage separation into clusters by k-means clustering feature output and using the clusters to generate pseudo-labels; which are then optimized for by a dense layer.

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (2020)

Online Network: Predict Target Net image prediction under augmented view
Target Network: Network updated with slow-moving average of online network.
SOTA without negative examples; learning from slow-moving average network.

Masked Autoencoders Are Scalable Vision Learners (2021)

Might be good to talk about alongside BEIT

BEIT: BERT Pre-Training of Image Transformers (2021)
Representation Learning With Contrastive Predictive Coding (2019)

You can try to move this one elsewhere. Let me know. I’m not sure where.

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision (2022)

Might be a great ending point to talk about at the end of the presentation to wrap it all up.

Towards the Generalization of Contrastive Self-Supervised Learning (2022)

I don’t really think we should use this one, ngl.

2 of 73

Idea outline

Might be good to have the format:

Reminder of what it self-supervised learning looks like and why it’s cool (8 minutes)
How it can be used simply for transferable feature extraction (6 minutes)
How it can be used simply for cross-view generalizability (6 minutes)
How it can be used with masking to create a self-supervisory task. (6 minutes)
How that naturally leads to its incorporation with bert-like model. (14 minutes)
Whatever the heck the CPC paper is doing (I didn’t look at it lol). (10-20 minutes)
(If we have time) circle back and talk about how this stuff seems to work well in practice.

Talk about “Robust” paper. Ask about how well the techniques in prez help with this. (skim at minimum).

3 of 73

Self-Supervised Learning

for Images

Sijie Ding, Tian Yun, Vadim Kudlay

4 of 73

Recall Contrastive Learning

Recall Self-Supervised

Why not do it on images?

Separate unlabeled datasets into clusters (which can serve as labels)

Using unlabeled data to supervise model training.

5 of 73

Self-Supervised Vision Tasks Aren’t That New

Dimensionality Reduction by Learning an Invariant Mapping (LeCun 2006)

6 of 73

Recall Contrastive Learning

Recall Self-Supervised

Why not do it on images?

Separate unlabeled datasets into clusters (which can serve as labels)

Using unlabeled data to supervise model training.

Leverage large amounts of unlabeled data

7 of 73

Data For Self-Supervised Models Is Everywhere

8 of 73

Recall Contrastive Learning

Recall Self-Supervised

Why not do it on images?

Separate unlabeled datasets into clusters (which can serve as labels)

Using unlabeled data to supervise model training.

Leverage large amounts of unlabeled data

Build more robust representations than those trainable on labeled data

9 of 73

Claims Good Generalizability When Done Well

“In particular, one of our key empirical findings is that self-supervised learning on random internet data leads to models that are more fair, less biased and less harmful.

Second, we observe that our model is also able to leverage the diversity of concepts in the dataset to train more robust features, leading to better out-of-distribution generalization.”

10 of 73

Let’s Exploit This To (Try To)

Make Better Models

11 of 73

Transferable Feature Extractors

We want filters that work as good starting points for a variety of tasks.

Classification

Segmentation

Dataset A

Dataset B

12 of 73

Transferable Feature Extractors

We want filters that work as good starting points for a variety of tasks.

Classification

Segmentation

Dataset A

Dataset B

13 of 73

Deep Clustering for Unsupervised Learning of Visual Features (2019)

Use bag-of-features clustering to drive optimization.

14 of 73

Deep Clustering for Unsupervised Learning of Visual Features (2019)

Use bag-of-features clustering to drive optimization.

Alleviates optimizing for dataset-specific labels.

Encourages diversity in extracted feature space.

15 of 73

16 of 73

Cross-View Prediction

Multi-view Action Recognition using Cross-view Video Prediction, Vyas et. al. 2020

We want images that are of the same thing be predicted similarly.

17 of 73

Cross-View Prediction

We want images that are of the same thing be predicted similarly.

Augmentation 𝒇(x) ~ x, so I want to enforce that. Naive attempt?

𝒇

ᯈ

18 of 73

Cross-View Prediction

We want images that are of the same thing be predicted similarly.

Augmentation 𝒇(x) ~ x, so I want to enforce that. Naive attempt?

𝒇

ᯈ

Q1: How might this be handled in a supervised learning formulation?

Q2: Are there any problems with the naive self-supervised attempt above?

19 of 73

Bootstrap Your Own Latent (2020)

Apply augmentations

Apply encoder (FE) into feature rep

Apply projection

Apply predictor

Apply different augmentations

encoder/projection are slow-moving averages of f, g

Pineapple from AgrilPlant dataset

Paper: Data Augmentation for Plant Classification (Pawara et. al 2017)

Current Network Version

“Lagged” Average Version

20 of 73

Deep Clustering

leverages latent space clustering techniques

Bootstrap Your Own Latent

leverages specially-designed model organization.

21 of 73

BEIT: BERT Pre-Training of Image Transformer

Hangbo Bao, Li Dong, Furu Wei

22 of 73

Supervised training

-> Require annotations

Data hungary

23 of 73

24 of 73

How to incorporate MLM into vision?

What are the challenges?

25 of 73

Output space is hard to define for image patches

Pixel recovery task may waste modeling capability on short-range dependencies and high-frequency details

26 of 73

Output space is hard to define for image patches

Pixel recovery task may waste modeling capability on short-range dependencies and high-frequency details

Masked Image Modeling (MIM) !

27 of 73

28 of 73

29 of 73

30 of 73

31 of 73

32 of 73

Why block-wise masking?

Can you think of one of our previous readings?

33 of 73

34 of 73

BEiT & VAE

𝒙

x̃

𝐳

35 of 73

BEiT & VAE

𝒙

x̃

𝐳

36 of 73

BEiT & VAE

𝒙

x̃

𝐳

tokenizer

MIM

decoder

37 of 73

BEiT & VAE

𝒙

x̃

𝐳

tokenizer

MIM

decoder

38 of 73

Experiments

Task Head

39 of 73

Image Classification

40 of 73

Image Classification

41 of 73

Image Classification

42 of 73

Image Classification

43 of 73

Image Classification

44 of 73

Image Classification

45 of 73

Semantic Segmentation

46 of 73

Ablation Studies

47 of 73

Qualitative Analysis

48 of 73

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick

49 of 73

Masked Autoencoder

50 of 73

Masked Autoencoder

Can we do the same on auto-regressive language models?

Why would such masking strategy work for images? Can you think of other areas that could use such strategy?

51 of 73

52 of 73

53 of 73

Better & Speedup!

54 of 73

Pixel vs. Token

55 of 73

Masking Strategy

56 of 73

Masking Strategy

Why does such masking approach work for MAE not BEiT?

BEiT Ablation

57 of 73

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, Oriol Vinyals

58 of 73

Introduction

Compress high-dimensional data into a much more compact latent embedding space in which conditional predictions are easier to model.
Use powerful autoregressive models in this latent space to make predictions many steps in the future.
Rely on Noise-Contrastive Estimation for the loss function in similar ways that have been used for learning word embeddings in natural language models, allowing for the whole model to be trained end-to-end.

59 of 73

Motivation and Intuitions

To learn the representations that encode the underlying shared information between different parts of the (high-dimensional) signal. At the same time it discards low-level information and noise that is more local.
In time series and high-dimensional modeling, approaches that use next step prediction exploit the local smoothness of the signal.
When predicting further in the future, the amount of shared information becomes much lower, and the model needs to infer more global structure.

60 of 73

Challenges

Predicting high-dimensional data makes that unimodal losses such as mean squared error and cross-entropy are not very useful. Powerful conditional generative models which need to reconstruct every detail in the data are usually required.
These models are computationally intense, and waste capacity at modeling the complex relationships in the data x, often ignoring the context c.

61 of 73

Images may contain thousands of bits of information while the high-level latent variables such as the class label contain much less information (10 bits for 1,024 categories). This suggests that modeling p(x|c) directly may not be optimal for the purpose of extracting shared information between x and c.

When predicting future information, the paper instead encode the target x (future) and context c (present) into a compact distributed vector representations (via non-linear 2 learned mappings) in a way that maximally preserves the mutual information of the original signals x and c defined as

62 of 73

Contrastive Predictive Coding

a non-linear encoder g_enc maps the input sequence of observations x_t to a sequence of latent representations z_t = g_enc(xt), potentially with a lower temporal resolution.

an autoregressive model g_ar summarizes all z_{≤t} in the latent space and produces a context latent representation c_t = g_{ar}(z≤t).

63 of 73

As argued in the previous section, the paper does not predict future observations xt+k directly with a generative model pk(xt+k|ct). Instead, it models a density ratio which preserves the mutual information between x_{t+k} and c_t

where ∝ stands for ’proportional to’ (i.e. up to a multiplicative constant).

Paper uses a simple log-bilinear model:

64 of 73

InfoNCE Loss and Mutual Information Estimation

Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE, which the paper calls as InfoNCE.

Given a set X = {x1, . . . xN } of N random samples containing one positive sample from p(xt+k|ct) and N − 1 negative samples from the ’proposal’ distribution p(xt+k), we optimize:

65 of 73

Experiments

66 of 73

Experiments-Vision

From a 256x256 image we extract a 7x7 grid of 64x64 crops with 32 pixels overlap.
Each crop is then encoded by the ResNet-v2-101 encoder.
Next, we use a PixelCNN-style autoregressive model to make predictions about the latent activations in following rows top-to-bottom.

67 of 73

Experiments-Vision

68 of 73

Experiments-Vision

69 of 73

Experiments-Audio

70 of 73

71 of 73

Experiments-Natural Language

72 of 73

Experiments-Reinforcement Learning

73 of 73

Conclusion

Contrastive Predictive Coding (CPC) is a framework for extracting compact latent representations to encode predictions over future observations.
CPC combines autoregressive modeling and noise-contrastive estimation with intuitions from predictive coding to learn abstract representations in an unsupervised fashion.
The simplicity and low computational requirements to train the model are exciting developments towards useful unsupervised learning that applies universally to many more data modalities.