Idea outline
Idea outline
Might be good to have the format:
Self-Supervised Learning
for Images
Sijie Ding, Tian Yun, Vadim Kudlay
Recall Contrastive Learning
Recall Self-Supervised
Why not do it on images?
Separate unlabeled datasets into clusters (which can serve as labels)
Using unlabeled data to supervise model training.
Self-Supervised Vision Tasks Aren’t That New
Dimensionality Reduction by Learning an Invariant Mapping (LeCun 2006)
Recall Contrastive Learning
Recall Self-Supervised
Why not do it on images?
Separate unlabeled datasets into clusters (which can serve as labels)
Using unlabeled data to supervise model training.
Leverage large amounts of unlabeled data
Data For Self-Supervised Models Is Everywhere
Recall Contrastive Learning
Recall Self-Supervised
Why not do it on images?
Separate unlabeled datasets into clusters (which can serve as labels)
Using unlabeled data to supervise model training.
Leverage large amounts of unlabeled data
Build more robust representations than those trainable on labeled data
Claims Good Generalizability When Done Well
“In particular, one of our key empirical findings is that self-supervised learning on random internet data leads to models that are more fair, less biased and less harmful.
Second, we observe that our model is also able to leverage the diversity of concepts in the dataset to train more robust features, leading to better out-of-distribution generalization.”
Let’s Exploit This To (Try To)
Make Better Models
Transferable Feature Extractors
We want filters that work as good starting points for a variety of tasks.
Classification
Segmentation
Segmentation
Dataset A
Dataset A
Dataset B
Transferable Feature Extractors
We want filters that work as good starting points for a variety of tasks.
Classification
Segmentation
Segmentation
Dataset A
Dataset A
Dataset B
Deep Clustering for Unsupervised Learning of Visual Features (2019)
Use bag-of-features clustering to drive optimization.
Deep Clustering for Unsupervised Learning of Visual Features (2019)
Use bag-of-features clustering to drive optimization.
Alleviates optimizing for dataset-specific labels.
Encourages diversity in extracted feature space.
Cross-View Prediction
Multi-view Action Recognition using Cross-view Video Prediction, Vyas et. al. 2020
We want images that are of the same thing be predicted similarly.
Cross-View Prediction
We want images that are of the same thing be predicted similarly.
Augmentation 𝒇(x) ~ x, so I want to enforce that. Naive attempt?
𝒇
ᯈ
Cross-View Prediction
We want images that are of the same thing be predicted similarly.
Augmentation 𝒇(x) ~ x, so I want to enforce that. Naive attempt?
𝒇
ᯈ
Q1: How might this be handled in a supervised learning formulation?
Q2: Are there any problems with the naive self-supervised attempt above?
Bootstrap Your Own Latent (2020)
Apply augmentations
Apply encoder (FE) into feature rep
Apply projection
Apply predictor
Apply different augmentations
encoder/projection are slow-moving averages of f, g
Pineapple from AgrilPlant dataset
Paper: Data Augmentation for Plant Classification (Pawara et. al 2017)
Current Network Version
“Lagged” Average Version
Deep Clustering
leverages latent space clustering techniques
Bootstrap Your Own Latent
leverages specially-designed model organization.
BEIT: BERT Pre-Training of Image Transformer
Hangbo Bao, Li Dong, Furu Wei
-> Require annotations
How to incorporate MLM into vision?
What are the challenges?
Masked Image Modeling (MIM) !
Why block-wise masking?
Can you think of one of our previous readings?
BEiT & VAE
𝒙
x̃
𝐳
BEiT & VAE
𝒙
x̃
𝐳
BEiT & VAE
𝒙
x̃
𝐳
tokenizer
MIM
decoder
BEiT & VAE
𝒙
x̃
𝐳
tokenizer
MIM
decoder
Experiments
Task Head
Image Classification
Image Classification
Image Classification
Image Classification
Image Classification
Image Classification
Semantic Segmentation
Ablation Studies
Qualitative Analysis
Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick
Masked Autoencoder
Masked Autoencoder
Can we do the same on auto-regressive language models?
Why would such masking strategy work for images? Can you think of other areas that could use such strategy?
Better & Speedup!
Pixel vs. Token
Masking Strategy
Masking Strategy
Why does such masking approach work for MAE not BEiT?
BEiT Ablation
Representation Learning with Contrastive Predictive Coding
Introduction
Motivation and Intuitions
Challenges
Images may contain thousands of bits of information while the high-level latent variables such as the class label contain much less information (10 bits for 1,024 categories). This suggests that modeling p(x|c) directly may not be optimal for the purpose of extracting shared information between x and c.
When predicting future information, the paper instead encode the target x (future) and context c (present) into a compact distributed vector representations (via non-linear 2 learned mappings) in a way that maximally preserves the mutual information of the original signals x and c defined as
Contrastive Predictive Coding
As argued in the previous section, the paper does not predict future observations xt+k directly with a generative model pk(xt+k|ct). Instead, it models a density ratio which preserves the mutual information between x_{t+k} and c_t
where ∝ stands for ’proportional to’ (i.e. up to a multiplicative constant).
Paper uses a simple log-bilinear model:
InfoNCE Loss and Mutual Information Estimation
Both the encoder and autoregressive model are trained to jointly optimize a loss based on NCE, which the paper calls as InfoNCE.
Given a set X = {x1, . . . xN } of N random samples containing one positive sample from p(xt+k|ct) and N − 1 negative samples from the ’proposal’ distribution p(xt+k), we optimize:
Experiments
Experiments-Vision
Experiments-Vision
Experiments-Vision
Experiments-Audio
Experiments-Natural Language
Experiments-Reinforcement Learning
Conclusion