SELF-SUPERVISED LEARNING
Divyanshu Tak
Parth Kharwar
Submitted to : Prof. Chao
Outline
Supervised Learning -> Train a model to learn from "labeled" data. The labels here essentially provide the supervision that the model need to differentiate right from wrong.
Unsupervised Learning -> The learning algorithm aims to discover / learn the underlying distribution of the data. Example – Clustering
Src : https://analystprep.com/study-notes/cfa-level-2/quantitative-method/supervised-machine-learning-unsupervised-machine-learning-deep-learning/
Motivation
Supervised Learning in Computer Vision has applications like image classification, object detection etc.
But all the underlying application consists of the same idea which is, to train the model using labeled/annotated data. This way we can steer the model's output in the direction we want.
Src: https://www.analyticsvidhya.com/blog/2021/05/introduction-to-supervised-deep-learning-algorithms/
Motivation
Then why do we need self-supervised learning?
We have limited labeled/annotated data. Or in other words we have much more unlabeled data than we have labeled data.
Annotation is Hard.....
ImageNet has 14M images. This took 22 human years to annotate the data.
This is where self-supervised learning comes in. How do we use the unlabeled data to make our models learn features that can be used for different transfer tasks.
Supervised learning example -> Take a pretrained ResNet model (trained on ImageNet) and modify it to classify between different breeds of dogs. But how can we do the same with unlabeled data.
Motivation
CASE 1:
CASE 2:
Similarity
Src: https://pyimagesearch.com/2020/11/30/siamese-networks-with-keras-tensorflow-and-deep-learning/
So essentially, the aim is to learn the features using the unlabeled data, which can be leveraged for further downstream tasks.
Siamese Network
Src: https://pyimagesearch.com/2020/11/30/siamese-networks-with-keras-tensorflow-and-deep-learning/
Since the network weights are shared, it is fairly easy for the network to eventually converge to trivial solutions.
Trivial solutions occur when the model outputs same features for different images.
The whole research is this field is essentially coming up with different solution to avoid the trivial solutions in these sort of networks.
Trivial Solution
[1] Chen, Xinlei, and Kaiming He. "Exploring simple siamese representation learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
By using the stop-grad operation and creating an asymmetric network , the algorithms prevents trivial solutions.
SIMSIAM
Src: Google AI blog, Advancing Self-Supervised and Semi-Supervised Learning with SimCLR
Overview of the approach
SimCLRv2
Src: Google AI blog, Advancing Self-Supervised and Semi-Supervised Learning with SimCLR
SimCLRv2
Src: Big Self-Supervised Models are Strong Semi-Supervised Learners, Ting Chen, et al.
SimCLRv2
Src: Momentum Contrast for Unsupervised Visual Representation Learning, Kiaming He, et al.
Momentum Contrast
Src: Medium: Understanding Contrastive Learning and MoCo, Shuchen Du
Src: YouTube: Momentum Contrast for Unsupervised Visual Representation Learning, Kiaming He, et al.
Momentum Contrast
DISCRIMINATIVE METHODS
GENERATIVE METHODS
Discriminative vs Generative Models
REPRESNTATIVE FEATURES
LINEAR CLASSIFICATION
FINE TUNING
Evaluation
Questions?
Masked Autoencoders Are Scalable Vision Learners
Zheng, C., Cham, T. J., & Cai, J. (2021). Tfill:
Image completion via a transformer-based architecture.
arXiv preprint arXiv:2104.00845.
One Approach to train a model and learn features is by providing positive and negative pairs. These pairs
Are often transformed at the input to make the model learning robust. These transformations include random cropping/rotation/blurring etc.
Now, what if instead of these transformations we subsample the image.
One way is cutting random patches of the input image before feeding It to the network.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
Masked Autoencoders Are Scalable Vision Learners
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
Masked Input Patches provide strong self-supervisory signal for the model.
Encoder is decoupled from the decoder (lightweight).
Encoder works on non-masked patches while decoder works on all patches along with positional embeddings
Masked Autoencoders Are Scalable Vision Learners
MASKING RATIO
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
High masking ratio of 75% provides optimal results (SURPRISING!!)
Model infers to produce output, rather than simple extrapolation.
Masked Autoencoders Are Scalable Vision Learners
DECODER AND RECONSTRUCTION TARGET
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
MSE LOSS
Masked Autoencoders Are Scalable Vision Learners
LINEAR PROBING AND FINE TUNING
Linear Probing essentially fits linear classifier on the output features of the encoder.
Fine tuning trains the model (Non-Linear) for classification or other tasks.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
Experiments done with ViT-L/16 as the encoder on ImageNet-1K data.
Masked Autoencoders Are Scalable Vision Learners
MASKING STRATEGY AND PARTIAL FINE TUNING
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
Masking is done randomly on the image patches (Grid based approach).
Uniform distribution prevents center bias
Partial Fine tuning is a middle way between linear probing and fine-tuning
Train fine tune only the last layers and freeze the others
Experiments done with ViT-L/16 as the encoder on ImageNet-1K data.
Masked Autoencoders Are Scalable Vision Learners
RESULTS
Results on COCO Validation Set
Comparisions between MAE, self-supervised and supervised pretraining and finetuning on ImageNet-1K dataset
Masked Autoencoders Are Scalable Vision Learners
Questions?
Contrastive Learning
Clustering
Src: https://www.youtube.com/watch?v=8L10w1KoOU8&t=1593s
Clustering Approach
Unsupervised Learning of Visual Features�by Contrasting Cluster Assignments
(SWaV)
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020).
Unsupervised learning of visual features by contrasting cluster assignments.
Advances in Neural Information Processing Systems, 33, 9912-9924.
Src: https://www.youtube.com/watch?v=8L10w1KoOU8&t=1593s
Main Idea is to generate same features for different augmentation of an image
Cluster Image and its augmented version into same group/prototype/cluster center
SwAV
Contrastive instance learning
SwAV
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33, 9912-9924.
SwAV
Src: https://www.youtube.com/watch?v=8L10w1KoOU8&t=1593s
Instead of assigning hard codes, the algorithm assigns soft codes for prototypes.
Key idea here is that an Image and its Augmented/cropped version should produce similar codes for prototypes.
Loss is computed for Swapped prediction.
SwAV
MULTI CROP
Increasing the crops/augmented views of same image drastically increases memory requirements.
Multi Crop uses 2 standard resolution crop along with "V" (tuned parameter) low resolution crops.
Low Resolution crops
Standard crop
Loss Equation now generalizes to include all the "V" crops
SwAV
RESULTS
2. Comparision of SwAV against the other self-supervised algorithms with ResNet-50 on ImageNet Dataset
SwAV
Questions?
Bootstrap Your Own Latent
A New Approach to Self-Supervised Learning
Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., ... & Valko, M. (2020).
Bootstrap your own latent-a new approach to self-supervised learning.
Advances in Neural Information Processing Systems, 33, 21271-21284.
BYOL
What is bootstrapping?
Starting with a trivial operation and sustaining itself without external support
BYOL
encoder
projector
predictor
BYOL
Src: BYOL — Bootstrap Your Own Latent. Self-Supervised Approach To Learning | by Mayur Jain | Artificial Intelligence in Plain English
BYOL
Goal: Minimize a similarity loss between qθ(zθ) and sg(z’ξ ),
Optimization step
Target weight updates
BYOL
Results for linear evaluation
Results for semi-supervised training
Questions?
Thank You