1 of 42

SELF-SUPERVISED LEARNING

Divyanshu Tak

Parth Kharwar

Submitted to : Prof. Chao

2 of 42

Outline

Motivation
Similarity
Siamese Network
Trivial Solutions
SIMSIAM
SimCLRv2
MoCo
Discriminative vs Generative Models
Evaluation
Masked Autoencoders (MAE)
SwAV
BYOL (Bootstrap your Own Latent)

3 of 42

Supervised Learning -> Train a model to learn from "labeled" data. The labels here essentially provide the supervision that the model need to differentiate right from wrong.

Unsupervised Learning -> The learning algorithm aims to discover / learn the underlying distribution of the data. Example – Clustering

Src : https://analystprep.com/study-notes/cfa-level-2/quantitative-method/supervised-machine-learning-unsupervised-machine-learning-deep-learning/

Motivation

4 of 42

Supervised Learning in Computer Vision has applications like image classification, object detection etc.

But all the underlying application consists of the same idea which is, to train the model using labeled/annotated data. This way we can steer the model's output in the direction we want.

Src: https://www.analyticsvidhya.com/blog/2021/05/introduction-to-supervised-deep-learning-algorithms/

Motivation

5 of 42

Then why do we need self-supervised learning?

We have limited labeled/annotated data. Or in other words we have much more unlabeled data than we have labeled data.

Annotation is Hard.....

ImageNet has 14M images. This took 22 human years to annotate the data.

This is where self-supervised learning comes in. How do we use the unlabeled data to make our models learn features that can be used for different transfer tasks.

Supervised learning example -> Take a pretrained ResNet model (trained on ImageNet) and modify it to classify between different breeds of dogs. But how can we do the same with unlabeled data.

Motivation

6 of 42

CASE 1:

CASE 2:

Similarity

7 of 42

Src: https://pyimagesearch.com/2020/11/30/siamese-networks-with-keras-tensorflow-and-deep-learning/

So essentially, the aim is to learn the features using the unlabeled data, which can be leveraged for further downstream tasks.

Siamese Network

8 of 42

Src: https://pyimagesearch.com/2020/11/30/siamese-networks-with-keras-tensorflow-and-deep-learning/

Since the network weights are shared, it is fairly easy for the network to eventually converge to trivial solutions.

Trivial solutions occur when the model outputs same features for different images.

The whole research is this field is essentially coming up with different solution to avoid the trivial solutions in these sort of networks.

Trivial Solution

9 of 42

[1] Chen, Xinlei, and Kaiming He. "Exploring simple siamese representation learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

By using the stop-grad operation and creating an asymmetric network , the algorithms prevents trivial solutions.

SIMSIAM

10 of 42

Src: Google AI blog, Advancing Self-Supervised and Semi-Supervised Learning with SimCLR

Overview of the approach

SimCLRv2

11 of 42

Src: Google AI blog, Advancing Self-Supervised and Semi-Supervised Learning with SimCLR

Self-supervised pretraining with SIMCLRv2

SimCLRv2

12 of 42

Self-training using fine-tuned network to create a smaller task focused network

Minimization of

Combination with ground truth labels when significant present

Src: Big Self-Supervised Models are Strong Semi-Supervised Learners, Ting Chen, et al.

SimCLRv2

13 of 42

Basic idea: building dynamic dictionaries in a queue

Keys: encoder network representation of sampled images

Lookup via Info NCE contrastive loss

Src: Momentum Contrast for Unsupervised Visual Representation Learning, Kiaming He, et al.

Momentum Contrast

14 of 42

Randomly sampled but consistent keys ensures dynamic nature

Dictionary is large with several negative samples

Queue allows larger size and replaces outdates samples

Src: Medium: Understanding Contrastive Learning and MoCo, Shuchen Du

Src: YouTube: Momentum Contrast for Unsupervised Visual Representation Learning, Kiaming He, et al.

Momentum Contrast

15 of 42

DISCRIMINATIVE METHODS

Contrastive Learning
Similarity Based Learning
Clustering

GENERATIVE METHODS

MAE
GAN
Variational Autoencoder (VAE)

Discriminative vs Generative Models

16 of 42

REPRESNTATIVE FEATURES

Object Detection
Instance Segmentation

PASCAL VOC
MS COCO
ImageNet

LINEAR CLASSIFICATION

FINE TUNING

Evaluation

18 of 42

Masked Autoencoders Are Scalable Vision Learners

Zheng, C., Cham, T. J., & Cai, J. (2021). Tfill:

Image completion via a transformer-based architecture.

arXiv preprint arXiv:2104.00845.

19 of 42

One Approach to train a model and learn features is by providing positive and negative pairs. These pairs

Are often transformed at the input to make the model learning robust. These transformations include random cropping/rotation/blurring etc.

Now, what if instead of these transformations we subsample the image.

One way is cutting random patches of the input image before feeding It to the network.

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

Masked Autoencoders Are Scalable Vision Learners

20 of 42

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

Masked Input Patches provide strong self-supervisory signal for the model.

Encoder is decoupled from the decoder (lightweight).

Encoder works on non-masked patches while decoder works on all patches along with positional embeddings

Masked Autoencoders Are Scalable Vision Learners

21 of 42

MASKING RATIO

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

High masking ratio of 75% provides optimal results (SURPRISING!!)

Model infers to produce output, rather than simple extrapolation.

Masked Autoencoders Are Scalable Vision Learners

22 of 42

DECODER AND RECONSTRUCTION TARGET

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

Decoder predicts the pixel values of the masked patches.

MSE Loss is computed between the reconstructed image and original image.

Decoder reconstructs the input from the latent representation along with mask tokens

MSE LOSS

Masked Autoencoders Are Scalable Vision Learners

23 of 42

LINEAR PROBING AND FINE TUNING

Linear Probing essentially fits linear classifier on the output features of the encoder.

Fine tuning trains the model (Non-Linear) for classification or other tasks.

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

Experiments done with ViT-L/16 as the encoder on ImageNet-1K data.

Masked Autoencoders Are Scalable Vision Learners

24 of 42

MASKING STRATEGY AND PARTIAL FINE TUNING

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

Masking is done randomly on the image patches (Grid based approach).

Uniform distribution prevents center bias

Partial Fine tuning is a middle way between linear probing and fine-tuning

Train fine tune only the last layers and freeze the others

Experiments done with ViT-L/16 as the encoder on ImageNet-1K data.

Masked Autoencoders Are Scalable Vision Learners

25 of 42

RESULTS

Results on COCO Validation Set

Comparisions between MAE, self-supervised and supervised pretraining and finetuning on ImageNet-1K dataset

Masked Autoencoders Are Scalable Vision Learners

27 of 42

Contrastive Learning

Clustering

Src: https://www.youtube.com/watch?v=8L10w1KoOU8&t=1593s

Clustering Approach

28 of 42

Unsupervised Learning of Visual Features�by Contrasting Cluster Assignments

(SWaV)

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020).

Unsupervised learning of visual features by contrasting cluster assignments.

Advances in Neural Information Processing Systems, 33, 9912-9924.

29 of 42

Src: https://www.youtube.com/watch?v=8L10w1KoOU8&t=1593s

Main Idea is to generate same features for different augmentation of an image

Cluster Image and its augmented version into same group/prototype/cluster center

SwAV

30 of 42

Contrastive instance learning

SwAV

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912-9924.

SwAV

31 of 42

Src: https://www.youtube.com/watch?v=8L10w1KoOU8&t=1593s

Instead of assigning hard codes, the algorithm assigns soft codes for prototypes.

Key idea here is that an Image and its Augmented/cropped version should produce similar codes for prototypes.

Loss is computed for Swapped prediction.

SwAV

32 of 42

MULTI CROP

Increasing the crops/augmented views of same image drastically increases memory requirements.

Multi Crop uses 2 standard resolution crop along with "V" (tuned parameter) low resolution crops.

Low Resolution crops

Standard crop

Loss Equation now generalizes to include all the "V" crops

SwAV

33 of 42

RESULTS

Comparision of SwAV against supervised pretraining of ResNet-50 on ImageNet Dataset

2. Comparision of SwAV against the other self-supervised algorithms with ResNet-50 on ImageNet Dataset

SwAV

35 of 42

Bootstrap Your Own Latent

A New Approach to Self-Supervised Learning

Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., ... & Valko, M. (2020).

Bootstrap your own latent-a new approach to self-supervised learning.

Advances in Neural Information Processing Systems, 33, 21271-21284.

36 of 42

Idea: Use two neural networks: online and target
Bootstrap entire output representation from online network
Contrastive approach to prevent collapse
Avoid negative pairs to save costs

BYOL

What is bootstrapping?

Starting with a trivial operation and sustaining itself without external support

37 of 42

BYOL

encoder

projector

predictor

38 of 42

BYOL

Src: BYOL — Bootstrap Your Own Latent. Self-Supervised Approach To Learning | by Mayur Jain | Artificial Intelligence in Plain English

39 of 42

BYOL

Goal: Minimize a similarity loss between qθ(zθ) and sg(z’ξ ),

Optimization step

Target weight updates

1 of 42

2 of 42

3 of 42

4 of 42

5 of 42

6 of 42

7 of 42

8 of 42

9 of 42

10 of 42

11 of 42

12 of 42

13 of 42

14 of 42

15 of 42

16 of 42

17 of 42

18 of 42

19 of 42

20 of 42

21 of 42

22 of 42

23 of 42

24 of 42

25 of 42

26 of 42

27 of 42

28 of 42

29 of 42

30 of 42

31 of 42

32 of 42

33 of 42

34 of 42

35 of 42

36 of 42

37 of 42

38 of 42

39 of 42

40 of 42

41 of 42

42 of 42