1 of 42

SELF-SUPERVISED LEARNING 

Divyanshu Tak 

Parth Kharwar 

Submitted to : Prof. Chao

2 of 42

Outline

  • Motivation
  • Similarity
  • Siamese Network
  • Trivial Solutions
  • SIMSIAM
  • SimCLRv2
  • MoCo
  • Discriminative vs Generative Models
  • Evaluation
  • Masked Autoencoders (MAE)
  • SwAV
  • BYOL (Bootstrap your Own Latent)

3 of 42

Supervised Learning -> Train a model to learn from "labeled" data. The labels here essentially provide the supervision that the model need to differentiate right from wrong. 

Unsupervised Learning -> The learning algorithm aims to discover / learn the underlying distribution of the data. Example – Clustering 

Src : https://analystprep.com/study-notes/cfa-level-2/quantitative-method/supervised-machine-learning-unsupervised-machine-learning-deep-learning/

Motivation

4 of 42

Supervised Learning in Computer Vision has applications like image classification, object detection etc. 

But all the underlying application consists of the same idea which is, to train the model using labeled/annotated data. This way we can steer the model's output in the direction we want. 

Src: https://www.analyticsvidhya.com/blog/2021/05/introduction-to-supervised-deep-learning-algorithms/

Motivation

5 of 42

Then why do we need self-supervised learning?

We have  limited labeled/annotated data. Or in other words we have much more unlabeled data than we have labeled data. 

Annotation is Hard.....

ImageNet has 14M images. This took 22 human years to annotate the data. 

This is where self-supervised learning comes in. How do we use the unlabeled data to make our models learn features that can be used for different transfer tasks. 

Supervised learning example -> Take a pretrained ResNet model (trained on ImageNet) and modify it to classify between different breeds of dogs. But how can we do the same with unlabeled data.

Motivation

6 of 42

CASE 1:

CASE 2:

Similarity

7 of 42

Src: https://pyimagesearch.com/2020/11/30/siamese-networks-with-keras-tensorflow-and-deep-learning/

So essentially, the aim is to learn the features using the unlabeled data, which can be leveraged for further downstream tasks.

Siamese Network

8 of 42

Src: https://pyimagesearch.com/2020/11/30/siamese-networks-with-keras-tensorflow-and-deep-learning/

Since the network weights are shared, it is fairly easy for the network to eventually converge to trivial solutions. 

Trivial solutions occur when the model outputs same features for different images.  

The whole research is this field is essentially coming up with different solution to avoid the trivial solutions in these sort of networks.

Trivial Solution

9 of 42

[1] Chen, Xinlei, and Kaiming He. "Exploring simple siamese representation learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

By using the stop-grad operation and creating an asymmetric network , the algorithms prevents trivial solutions. 

SIMSIAM

10 of 42

Src: Google AI blog, Advancing Self-Supervised and Semi-Supervised Learning with SimCLR

Overview of the approach

SimCLRv2

11 of 42

Src: Google AI blog, Advancing Self-Supervised and Semi-Supervised Learning with SimCLR

  • Self-supervised pretraining with SIMCLRv2

SimCLRv2

12 of 42

  • Self-training using fine-tuned network to create a smaller task focused network

  • Minimization of

  • Combination with ground truth labels when significant present

Src: Big Self-Supervised Models are Strong Semi-Supervised Learners, Ting Chen, et al.

SimCLRv2

13 of 42

  • Basic idea: building dynamic dictionaries in a queue
    • Keys: encoder network representation of sampled images

  • Lookup via Info NCE contrastive loss

Src: Momentum Contrast for Unsupervised Visual Representation Learning, Kiaming He, et al.

Momentum Contrast

14 of 42

  • Randomly sampled but consistent keys ensures dynamic nature

  • Dictionary is large with several negative samples

  • Queue allows larger size and replaces outdates samples

Src: Medium: Understanding Contrastive Learning and MoCo, Shuchen Du

Src: YouTube: Momentum Contrast for Unsupervised Visual Representation Learning, Kiaming He, et al.

Momentum Contrast

15 of 42

DISCRIMINATIVE METHODS 

  • Contrastive Learning 
  • Similarity Based Learning 
  • Clustering 

GENERATIVE METHODS 

  • MAE
  • GAN 
  • Variational Autoencoder (VAE) 

Discriminative vs Generative Models

16 of 42

REPRESNTATIVE FEATURES 

  • Object Detection 
  • Instance Segmentation 

  • PASCAL VOC 
  • MS COCO
  • ImageNet

LINEAR CLASSIFICATION

FINE TUNING 

Evaluation

17 of 42

Questions?

18 of 42

Masked Autoencoders Are Scalable Vision Learners

Zheng, C., Cham, T. J., & Cai, J. (2021). Tfill:

Image completion via a transformer-based architecture.

arXiv preprint arXiv:2104.00845.

19 of 42

One Approach to train a model and learn features is by providing positive and negative pairs. These pairs 

Are often transformed at the input to make the model learning robust. These transformations include random cropping/rotation/blurring etc. 

Now, what if instead of these transformations we subsample the image. 

One way is cutting random patches of the input image before feeding It to the network.

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

Masked Autoencoders Are Scalable Vision Learners   

20 of 42

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

Masked Input Patches provide strong self-supervisory signal for the model. 

Encoder is decoupled from the decoder (lightweight). 

Encoder works on non-masked patches while decoder works on all patches along with positional embeddings

Masked Autoencoders Are Scalable Vision Learners   

21 of 42

MASKING RATIO

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

High masking ratio of 75%  provides optimal results (SURPRISING!!)

Model infers to produce output, rather than simple extrapolation. 

Masked Autoencoders Are Scalable Vision Learners   

22 of 42

DECODER AND RECONSTRUCTION TARGET

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

  • Decoder predicts the pixel values of the masked patches. 

  • MSE Loss is computed between the reconstructed image and original image. 

  • Decoder reconstructs the input from the latent representation along with mask tokens

MSE LOSS

Masked Autoencoders Are Scalable Vision Learners   

23 of 42

LINEAR PROBING AND FINE TUNING

Linear Probing essentially fits linear classifier on the output features of the encoder. 

Fine tuning trains the model (Non-Linear) for classification or other tasks. 

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

Experiments done with ViT-L/16 as the encoder  on ImageNet-1K data.

Masked Autoencoders Are Scalable Vision Learners   

24 of 42

MASKING STRATEGY AND PARTIAL FINE TUNING 

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

Masking is done randomly on the image patches (Grid based approach). 

Uniform distribution prevents center bias 

Partial Fine tuning is a middle way between linear probing and fine-tuning

Train fine tune only the last layers and freeze the others 

Experiments done with ViT-L/16 as the encoder  on ImageNet-1K data.

Masked Autoencoders Are Scalable Vision Learners   

25 of 42

RESULTS

Results on COCO Validation Set

Comparisions between MAE, self-supervised and supervised pretraining and finetuning on ImageNet-1K dataset 

Masked Autoencoders Are Scalable Vision Learners   

26 of 42

Questions?

27 of 42

Contrastive Learning

Clustering 

Src: https://www.youtube.com/watch?v=8L10w1KoOU8&t=1593s

Clustering Approach

28 of 42

Unsupervised Learning of Visual Features�by Contrasting Cluster Assignments

(SWaV)

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020).

Unsupervised learning of visual features by contrasting cluster assignments.

Advances in Neural Information Processing Systems, 33, 9912-9924.

29 of 42

Src: https://www.youtube.com/watch?v=8L10w1KoOU8&t=1593s

Main Idea is to generate same features for different augmentation of an image 

Cluster Image and its augmented version into same group/prototype/cluster center

SwAV

30 of 42

Contrastive instance learning 

SwAV

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912-9924.

SwAV

31 of 42

Src: https://www.youtube.com/watch?v=8L10w1KoOU8&t=1593s

Instead of assigning hard codes, the algorithm assigns soft codes for prototypes.

Key idea here is that an Image and its Augmented/cropped version should produce similar codes for prototypes. 

Loss is computed for Swapped prediction. 

SwAV

32 of 42

MULTI CROP

Increasing the crops/augmented views of same image drastically increases memory requirements. 

Multi Crop uses 2 standard resolution crop along with "V" (tuned parameter) low resolution crops. 

Low  Resolution crops 

Standard crop 

Loss Equation now generalizes to include all the "V" crops

SwAV

33 of 42

RESULTS

  1. Comparision of SwAV against supervised pretraining of ResNet-50 on ImageNet Dataset

2.    Comparision of SwAV against the other self-supervised algorithms with ResNet-50 on ImageNet Dataset 

SwAV

34 of 42

Questions?

35 of 42

Bootstrap Your Own Latent

A New Approach to Self-Supervised Learning

Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., ... & Valko, M. (2020).

Bootstrap your own latent-a new approach to self-supervised learning.

Advances in Neural Information Processing Systems, 33, 21271-21284.

36 of 42

  • Idea: Use two neural networks: online and target
  • Bootstrap entire output representation from online network
  • Contrastive approach to prevent collapse
  • Avoid negative pairs to save costs

BYOL

What is bootstrapping?

Starting with a trivial operation and sustaining itself without external support

37 of 42

BYOL

encoder

projector

predictor

38 of 42

BYOL

Src: BYOL — Bootstrap Your Own Latent. Self-Supervised Approach To Learning | by Mayur Jain | Artificial Intelligence in Plain English

39 of 42

BYOL

Goal: Minimize a similarity loss between qθ(zθ) and sg(z’ξ ),

Optimization step

Target weight updates

40 of 42

BYOL

Results for linear evaluation

Results for semi-supervised training

41 of 42

Questions?

42 of 42

Thank You