A Survey on Masked Autoencoder
for Self-supervised Learning in Vision and Beyond
Chaoning Zhang, Chenshuang Zhang, Junha Song, John Seon Keun Yi, In So Kweon
International Joint Conferences on Artificial Intelligence (IJCAI 2023)
[📖 arXiv]
Meeting Materials
CL
SSL
Pretraining
Supervised
CL
GAN
Prediction
pretext task
Supervised
pretraining
Generative
SSL
Metric learning, etc
Ours
Diagram for Self-supervised learning
2008 | … | 2010 | … … | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 |
Discriminative SSL
Generative SSL
Denoising autoencoder
Masked modeling
Masked autoencoder
Language
Vision
Geometry- based prediction
Joint-embedding methods
Denoising Autoencoder
ICML 2008
Stacked denoising Autoencoder
JMLR 2010
Inpainting
CVPR 2016
Automatic colorization
ECCV 2016
Colorization
ECCV 2016
Colorization
as proxy task
CVPR 2017
Split brain autoencoders
CVPR 2017
GPTv3
NeurIPS 2020
GPTv2
2019
GPTv1
2018
BERT [23]
NAACL 2018
[ None ]
iGPT
ICML 2020
ViT (iBERT)
ICLR 2021
MAE
CVPR 2022
BEiT
ICLR 2022
MoCo
CVPR 2020
SimSiam
CVPR 2021
Jigsaw puzzles
ECCV 2016
Rotation prediction
ICLR 2018
Flow chart
InfoNCE
Clustering
2019
2021
2022
CMC
2020
MoCo
MoCo v2
SimCLR
SimCLR v2
CPC
SwAV
SEER
Barlow Twins
BYOL
SimSiam
PIRL
Deep Cluster
InfoMin
RePS
SEER v2
DINO
MoCo v3
13 SSL
SelfAugment
JigsawClust
Feature Transform
Demystify contrastive
OBoW
W-MSE
VICReg
New augmentation
New invariance loss =
Distinct Invariant attribute
Potential/Analysis of Contrastive �with large experiments
Self-sup evaluation
With Larger data
With ViT
New pretext
w/o negatives
Decorrelation
TWIST
Self-classifier
NN positives
NNCLR
ReCM
MYOW
SpectralCL
Graphic N positives
Channel-wise decorrelation
DnC
To avoid collapse,
Additional loss, regularization,�Different approach
PixPro
ReLIC
IFM
SSL-HSIC
MoCHi
Understand CL
Properties CL
How avoid
local representation
Viewmaker
CAST
SCRL
SCCL
iBOT
Splitmask
simMIM
MIM
CAE
BeIT
MAE
data2vec
Siamese
hybrid
AASAE
TIM
MSN
MaskCo
Decoder
Tokenizer
123 234 456 567
987 876 765 543
112 223 334 445
211 322 433 544
Stage1
Stage2
456
876
765
322
7
Paper | InfoMin: What Makes for Good views for Contrastive Learning? | Confe, Cite | MIT Google, 344 |
methodological points | - The views should only share the label information. (Augmented images that share other information are not a dataset optimized for contrastive learning.)�- They apply a learnable and invertible view generator with and unsupervised adversarial objective, which minimizes mutual infomation (MI) but keep class/label information. - A generator aims to synthesize two invertible views from the same image x while minimizing mutual information I except for task-relevant information | ||
develop/ different points | - Study the influence (importance) of different view choices. | ||
Category | New augmentation |
8
Paper | Feature Transform: Improving Contrastive Learning by Visualizing Feature Transformation | Confe, Cite | 7, ICCV21 |
methodological points | - Data augmentation -> Feature transformation- Observation: Harder positives and negative boost the learning (positive but low-similarity) - Proposal: Feature transformation (for harder positive) and the interpolation among negative (for harder negative) (As shown in Figure 6, the former is a method that pushes two vectors to each other, and the latter is a method that makes two vectors pull each other. | ||
develop/ different points | - differing from data augmentation, they attempt a feature-level data manipulation. | ||
Category | New augmentation |
9
Paper | Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases | Confe, Cite | 78, CMU&FAIR |
methodological points | - (finding1) MoCo and PIRL succeed in occlusion-invariance, but fail viewpoint & category invariance which are crucial components for OD. I.e., sup-optimal representation. - (finding2) Despite bad viewpoint invariance, ImageNet ensures correspondence btw different crops for a self-training model. - This work propose the use of videos dataset for a more natural form for the instance discrimination task, which leads to higher viewpoint invariance." | ||
develop/ different points | - Previous methods can understand only occlusion-invariance (by cropping) | ||
Category | Potential / Analysis of contrastive learning |
10
Paper | W-MSE: Whitening for self supervised representation learning | Confe, Cite | 37, ICML21 |
methodological points | - Whitening-MSE method whitens, (normalizes=) spreads out the embedding vectors on the unit sphere, and attract each positive with MSE loss. - The whitening operation has a “scattering” effect on the batch samples. see Fig. 1." | ||
develop/ different points | - The same as VICReg. | ||
Category | New invariance loss |
11
Paper | OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning | Confe, Cite | 12, CVPR21 |
methodological points | - To learn contextual reasoning, they encourage a model to reconstruct a bag-of-visual-words (BoW) representation. - A bag-of-visual-words is updated online, which is similar to the update of code in SwAV" | ||
develop/ different points | - Compared to perturbation invariance, contextual reasoning might be favorable for reconstruction (generative pretext). - (1) OBoW exploits a BoW prediction task while SwAV uses an image-cluster prediction task. (2) a BoW encodes all the local feature (all elements) of an image but image-cluster assignment encodes only one global feature(centroids). Therefore, BoW targets are a richer representation." | ||
Category | Different Invariance / New invariance loss |
12
Paper | PixPro: Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning | Confe, Cite | MS, 90, CVPR21 |
methodological points | - Pixel-level contrastive learning: the feature from the corresponding pixels of the two views are encouraged to be consistent. - The first task, pixel contrastive task (PixContrast), was inferior to pixel consistency task (PixPro). But PixPro require PPM module which allows asymmetric architecture such as predictor and Sinkhorn-Knopp to avoid collapse." | ||
develop/ different points | - Current methods are trained only on instance-level pretext tasks. (crop regions are represent as instance-level) | ||
Category | New invariance loss for local representation |
13
Paper | ReLIC: Representation Learning via Invariant Causal Mechanisms | Confe, Cite | Deepmind, 51 |
methodological points | - Invariance objective (constraints) on pre-trained classifiers is required for effective data augmentation. (Figure1 (b) is enough to understand whole paper) - the principle of invariant prediction in BYOL is a sufficient condition for representation learning." | ||
develop/ different points | - Show analysis of generalization ability and robustness, unlike other papers. - Alternative explanation to similarity objective (mutual info) in BYOL is provided. - This framework (objective) can be used even in any vision tasks, unlike vanilla contrastive loss." | ||
Category | New invariance loss for any visual task |
14
Paper | Barlow Twins: Self-Supervised Learning via Redundancy Reduction | Confe, Cite | FAIR, 277 |
methodological points | - New objective function naturally avoiding collapse (cross-correlation matrix -> the identity matrix) - This causes vectors of distorted samples to be similar, while minimizing the redundancy (i channel feature != j channel feature)" | ||
develop/ different points | - The method does not require large batches, asymmetry, stop-gradient, and momentum update. | ||
Category | New invariance loss |
15
Paper | VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning | Confe, Cite | 59, FAIR, ICCV21 |
methodological points | - Two regularizations terms for avoiding the collapse and stabilizing the training (1) maintaining the variance above a threshold (2) decorrelating each pair - See Figure 2: conceptual comparison | ||
develop/ different points | - Unlike most approaches, VICReg does not require techniques, such as: large batch, negative sample, weight sharing (sharing weight and different weights, both work well in VIC), BN or feature norm, output quantization(SwAV), stop gradient, memory bank, momentum encoder, etc. - Without asymmetric architecture, Barlow, whitening, and VIC aim to maximize the information content of the embedding. Additionally, VIC prevents informational collapse by decorrelating." | ||
the covariance regularization
the variance regularization
the sum of the squared off-diagonal coefficients
Category | New invariance loss |
16
Paper | Decorrelation: On Feature Decorrelation in Self-Supervised Learning | Confe, Cite | 20, ICCV21 |
methodological points | - ( whitening = standardizing covariance = feature decorrelation ) - Concise framework: simple symmetric architecture (augmented views, encoder, MSE) - Collapse can be divided into complete collapse (constant) & Dimensional collapse (=The projected features collapse into a low-dimensional manifold, i.e., a strong correlation between axes). - The use of decorrelated Batch Normalization layer (DBN) (1) Algorithm: divide feature into groups and applies whitening (2) Effect: Alleviating dimensional collapse issue. " | ||
develop/ different points | - First dimensional collapse (difference from complete collapse) | ||
Category | To avoid Collapse \ New invariance loss |
17
18
Paper | RePS: Rethinking Pre-training and Self-training | Confe, Cite | GoogleBrain, 239 |
methodological points | - The commen sence is that the model pre-trained on ImageNet has limitation compared to the model pretrained on COCO as argued by Kaiming He. - How about self-training model??? - Results: (1) strong augmentation and more labeled data is not always good for 'pre-training' but 'self-training' (2) self-training improves upon pre-training. - Conclusion: Self-training has more enormous possibilities than pre-training." | ||
develop/ different points | - Comparison between the potentiality of pre-training with a large dataset and self-training with contrastive for other computer vision tasks. | ||
Category | Potential of self-training |
19
Paper | 13SSL: How Well Do Self-Supervised Models Transfer? | Confe, Cite | 40, CVPR21 |
methodological points | - We evaluate the transfer performance of 13 top self-supervised models on 40 downstream tasks, including many-shot and few-shot recognition, object detection, and dense prediction. - They propose several open questions and answer four questions. Refer to Introduction. It is easy to understand." | ||
develop/ different points | - Despite significant progress, there are a number of important open questions. - We need large-scale evaluation." | ||
Category | Comparison of SLL methods Potential of self-training |
20
Paper | SelfAugment: Automatic Augmentation Policies for Self-Supervised Learning | Confe, Cite | 10, CVPR21 |
methodological points | - Evaluating with a self-supervised image rotation task is highly correlated with a standard set of supervised evaluations - Automatic and efficient augmentation selection using this correlation. - Self-sup evaluation for filter hundreds for augmentation, training setting, and network architectures." | ||
develop/ different points | - Supervised evaluation requiring label data is not practical. (method의 효과를 측정하기 위해서 labeled dataset을 사용해서 Evaluation하는 것) | ||
Category | Self-sup evaluation |
21
Paper | Jigsaw Clustering for Unsupervised Visual Representation Learning | Confe, Cite | 7, CVPR21 |
methodological points | - Batch images are split, shuffled, and stiched. The target is to recover disrupted pars to the original images like a puzzle game. - Previous Jigsaw SSL permute patches in a single image, but this work does in a batch so that the network learn both intra- and inter- images information. - It consists of clustering branch and location branch. " | ||
develop/ different points | - Incremental works of Jigsaw puzzles. - Contrastive methods duplicate each training batch, i.e, dual batch. In contrast, we construct a single-batch." | ||
Category | New pretext task (not contrastive) |
22
Paper | CAE: Context Autoencoder for Self-Supervised Representation Learning | Confe, Cite | |
methodological points | - Two new modules: (1) Latent contextual regressor (2) Alignment loss module - (Insight) Caring about the representations for the patches, MIM (masked image modeling) methods learns semantics of both the center regions (1000classes) and other regions (potential class). This is different from contrastive methods that tend to learn semantics mainly from the center patches. (See Fig. 6)" | ||
develop/ different points | - 1. De-coupling the encoding (content understanding) and decoding roles(making predictions for masked patches) - 2. masked patches are also cast into the latent space.“ | ||
Category | New MAE |
23
Paper | DINO: Emerging Properties in Self-Supervised Vision Transformers (DINO: self-DIstillation with NO labels.) | Confe, Cite | FAIR, 276 |
methodological points | "- The attention map extracted from SSL+ViT model prove that the dependence on the background is very low. - Key methods (1) Cross-entropy loss with EMA teacher (mean teacher) model = knowledge distillation, (2) multi-crop (224x224 for teacher, 96x96 for student) (3) centering (momentum updated center is used to subtract output feature.) and sharpening (low temperature value in softmax) in order to escape collapse-solution. | ||
develop/ different points | - they interpret cross-entropy loss with EMA teacher model as knowledge distillation - In order to escape collapse, previous works used contrastive loss with negative samples, clustering constraints, predictor, and batch normalizations. But they propose Centering and sharpening. | ||
Category | Potential in Transformer Centering |
24
Paper | SEER: Self-supervised Pretraining of Visual Features in the Wild | Confe, Cite | FAIR, 64 |
methodological points | - large models on random and uncurated images (Instagram images) = a RegNetY with 1.3B parameters trained on 1B random images. - Use SwAV method. To save GPU memory, they use, aka, gradient checkpointing and mixed precision. | ||
develop/ different points | - Datasets originally collected for sup and weakly-wup represent a limited fraction of the general distribution. | ||
Category | In Larger data, Potential of self-training |
25
Paper | SEER 10B para: Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision | Confe, Cite | FAIR, 1 |
methodological points | - FAIR's project result paper, is a training model on diverse, real, and unfiltered internet data without any data pre-processing. �- They demonstrate that such a model has properties like geographical diversity, fairness, robustness, multilingual hashtag embeddings, artistic and better semantic information.�- Experiments prove that such a model is more robust, fairer, less harmful, and less biased than supervised models. | ||
develop/ different points | - Prior methods assume that, with the dataset they used, they produce features that are general enough to be re-used as they are in a variety of supervised tasks. | ||
Category | In much larger data, Potential of self-training |
26
Paper | MSN: Masked Siamese Networks for Label-Efficient Learning | Confe, Cite | 0 |
methodological points | - swapping assignment of SvAW / Knowledge distillation (momentum+CE loss) of DINO / Masking = strong argument = student - MSN does not predict the masked patches at the input level but rather performs the denoising step (self denoised-representation learning) implicitly at the representation level by ensuring that the representation of the masked input matches the representation of the unmasked one. | ||
develop/ different points | - Especially, it has competitive performance in low-shot classification, compared to DINO and MAE. - Joint-embedding architectures (Siamese) avoid reconstruction, but apply image transformation. | ||
Category | Siamese with MIM |
27
Paper | AASAE: Augmentation-Augmented Stochastic Autoencoders | Confe, Cite | 1 |
methodological points | - Autoencoder (reconstruction task) with image transformation, not image masking. | ||
develop/ different points | - A unique approach (;= recon?) different from both contrastive and non-contrastive (w/o negatives) - No negatives / Small mini-batch | ||
Category | TIM (transformed info modeling(recon)) |
28
Paper | TWIST: Self-Supervised Learning by Estimating Twin Class Distributions | Confe, Cite | 2 |
methodological points | - TWIST = Twin Class Distribution Estimation - (1) twin (consistent) class distributions of two augmented images. - To avoid collapse, they (2. sharpness term) minimize the entropy of the distribution for each sample to make the class prediction (3. diversity term) maximize the entropy of the mean distribution to make the predictions of different samples diverse. | ||
develop/ different points | - Without asymmetric / stop-gradient / momentum encoder - They show the comparisons of TWIST to DINO, SwAV, Self-classifier, and Barlow Twins. | ||
Category | New invariance loss |
29
Paper | Self-Supervised Classification Network | Confe, Cite | 4, IBM |
methodological points | - Self-classifier (not predictor, projector, really for classification) learns labels and representations simultaneously - To avoid degenerate solutions, (1) variant of the cross-entropy (2) asserting a uniform prior on class predictions (1) cross entropy -> Bayes laws -> new form of CE loss, which is mathematically equivalent to CE loss, but has the ability to avoid collapse. (Proof is in Sec. 4) (2) Uniform prior: p(y) = 1/C || p(x) = 1/N | ||
develop/ different points | - Without a single label, 41.1% ImageNet accuracy. - It does not require a memory bank, a second network(momentum), external clustering, stop-gradient operation, or negative pairs. | ||
Uniform Prior
It make the model avoid collapse
Category | new form of CE loss |
30
Paper | iBOT: Image BERT Pre-Training with Online Tokenizer | Confe, Cite | 22.ICLR22 |
methodological points | - iBoT performs masked prediction with an online tokenizer, which is taken as the momentum-updated teacher network. - Loss (1) self-distillation on [CLS] token like DINO (2) MIM objective | ||
develop/ different points | - BEiT uses a pre-trained discrete VAE as the tokenizer only capturing low-level semantics within local details. Moreover, the tokenizer needs to be offline pre-trained which leads to potential limitations in its adaptivity to perform MIM using data from different domains. - Our tokenizer captures high-level visual semantics and needs no extra stages of training | ||
shared
Category | Siamese with MIM |
31
Paper | NNCLR: With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations | Confe, Cite | 45, ICCV 2021, Deepmind |
methodological points | - NNCLR samples the nearest neighbors from the dataset in the latent space and treats them as positives where more semantic variations can be provided. - The proposed method is less reliant on complex data augmentations. Specifically, a relative reduction of only 2.1% ImageNet accuracy is seen when we train using only random crops. | ||
develop/ different points | - Most CL papers treat different views (by deformations) as positives, but they are interested in using positives from other instances in the dataset. - Past clustering-based methods also consider single instance positives, but assuming the entire cluster (or its prototype) to be positives could hurt performance due to early over-generalization. - In previous works, the onus of generalization lies heavily on the data augmentation pipeline, which cannot cover all the variances in a given class. | ||
Category | NN positives |
32
Paper | ReCM: Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations | Confe, Cite | 3, NIPS21 |
methodological points | - The first research about how biases in the dataset affect (not only ImageNet) existing methods. For example, MoCo performs surprisingly well in object- and scene-centric datasets. - The study about additional invariances such as (1) multi-scale cropping, (2) stronger augmentation policy, and (3) nearest neighbors. -> A key component is the augmentation strategy which involves random cropping. (:= masked information) -> Why? when multiple objects are present, a positive pair of non-overlapping crops could lead us to wrongfully match the feature representations of different objects. | ||
develop/ different points | - most methods still train on images from ImageNet, whose properties are (1) a single object in the center of the image (2) a class-uniform distribution ... - To deploy self-supervised learning into the wild, we need to quantify the dependence on these properties. | ||
Category | Potential / Analysis of contrastive learning |
33
Paper | MYOW: Mine Your Own vieW: Self-Supervised Learning Through Across-Sample Prediction | Confe, Cite | 11, Georgia Tech |
methodological points | - For contrastive learning without sufficient diversity in the transformations, MYOW utilizes the dataset itself to find similar samples (=minded views), i.e., finding the samples using the k-nearest neighbors of the anchor representation among the target representations of the pool of candidates. | ||
develop/ different points | - Previous works are challenging to find the right balance between augmentations that not only introduce sufficient diversity but also preserve the semantics of the original data. - MYOW obtains more diversity by finding diverse views within the dataset, not limited to a single instance. | ||
Dataset
Candidates
feature
space
select
the mined view
projector
?
Category | NN positives |
34
Paper | SpectralCL: Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss | Confe, Cite | 22.NIPS21 |
methodological points | - Positive pairs from the population augmentation graph leveraging continuity of the population data within the same class (=close neighbors within the same class). | ||
develop/ different points | - Conditional independence of positive pairs -> (past) same class label -> (current) augmented views (not practical) -> (ours) samples in 'augmentation graph'. | ||
?! T.T
Category | Graphic N positives |
35
Paper | DnC: Divide and Contrast: Self-supervised Learning from Uncurated Data | Confe, Cite | 12, Deepmind ICCV 2021 |
methodological points | - How to handle larger, less-curated datasets such as YFCC, different from ImageNet which is not a general, diverse, uncurated (사람의 가공작업(1000개의 클래스로만 제한 등)이 많이 들어간 데이터셋) dataset. - The base model learns global consistency while the expert model captures local consistency in subsets of YFCC. - a stronger baseline MoCLR (= SimCLR + BYOL) is suggested. | ||
develop/ different points | - So, Btw ImageNet and YFCC, 'curation gap' exist due to a shift in the distribution of image classes. - [figure 1] DnC is better able to handle the diverse and long-tailed distribution of images and improves more with longer training then BYOL, MoCLR(SimCLR+BYOL) | ||
Pre-trained�SimCLR & BYOL�(= MoCLR)�-> base model
Re-training�MoCLR�-> Expert model
Expert model
base model
base model
k-th expert model
Category | How to handle Larger data |
36
Paper | IFM: Can contrastive learning avoid shortcut solutions? | Confe, Cite | 8, NIPS21 |
methodological points | - To learn representation without inductive biases (generalization), implicit feature modification (IFM) is proposed, which can plug-and-play to any kind of SSL methods. - They propose modifying features by applying transformations to encoded samples v = f(x). Since they modify the encoded samples, instead of raw inputs x, we describe our method as implicit. - For generalized representation by SSL, the embeddings of positive and negative samples are modified to remove well represented features. - Rather than applying extremely challenging perturbations, it is advantageous to learn generalized representations to transform the features of positive and negative features that are easy to discriminate. | ||
develop/ different points | - Representation learned from CL is likely to inadvertently suppress important predictive features on downstream tasks. (=shortcuts solution, inductive biases) i.e., although hard augmentation improves ImageNet accuracy, it can suppress well-represented features. | ||
strong raw image permutation can not be always good.
feature permutation
InfoNCE + IFM
InfoNCE
Category | New augmentation |
37
Paper | SSL-HSIC: Self-Supervised Learning with Kernel Dependence Maximization | Confe, Cite | 7, NIPS21 |
methodological points | - maximizing dependence(similarity) between representations of two views while minimizing the kernelized variance. - [Mathemetically] To measure dependence, Hilbert-Schmidt Independence Criterion (HSIC) is used. The proposed loss called SSL-HSIC loss is inspired by HSIC Bottleneck. - InfoNCE can be thought of as 'SSL-HSIC with a variance-based regularization'. | ||
develop/ different points | - SSL-HSIC loss itself penalizes trivial solutions, so techniques such as target networks are not needed . | ||
Category | New invariance loss |
38
Paper | MoCHi: Hard Negative Mixing for Contrastive Learning | Confe, Cite | 153, NIPS20 |
methodological points | - (Finding) Harder negatives are needed to facilitate better and faster learning. Thus, hard negative mixing strategies at the feature level are proposed. - Synthesizing hard negatives: mixing some of the hardest negative features of the contrastive loss. - the hardest negatives mean the closest features to each positive. | ||
develop/ different points | - So far, large batch and large memory banks are used to get more meaningful negative samples. | ||
Category | New negative feature augmentation |
39
Paper | Understanding CL: Understanding the Behaviour of Contrastive Loss | Confe, Cite | 63, CVPR21 |
methodological points | - This paper simply analyzes the importance and effect of temperature on contrastive loss. - Summary: The smaller the temperature, the more similar the features. Conversely, if the temperature is large, hard negatives are generated, so good features for downstream can be learned. - Analysis (1) the contrastive loss is a hardness-aware loss function, (Sec. 3) which automatically concentrates on separating more informative negative samples to make the embedding distribution more uniform. So, the temperature is important. (2) the temperature τ controls the strength of penalties on hard negative samples. | ||
develop/ different points | - To learn separable features, tolerant (robust) to semantically similar samples, and useful features for downstream, temperature is important. | ||
Category | The effect of temperature in CL |
40
Paper | Property CL: Intriguing Properties of Contrastive Losses | Confe, Cite | 34, NIPS21 |
methodological points | - This paper aims to understand the effectiveness and limitation of existing contrastive learning. - Aspect 1: (Sec 2) a generalization of the standard contrastive loss. - Aspect 2: (Sec 3) whether or not it leads to meaningful local features on images with multiple objects present. -> SimCLR can learn on images with multiple objects - Aspect 3: (Sec 4) The feature suppression phenomenon (For example, “color distribution” and “object class” are often competing-features (trade-off)) | ||
develop/ different points | - Existing contrastive learning methods critically rely on data augmentation to favor certain sets of features over others, and could suffer from learning saturation for scenarios. | ||
Category | Analysis of CL |
41
Paper | SCRL: Spatially Consistent Representation Learning | Confe, Cite | 28, CVPR21 |
methodological points | - They propose a spatially consistent representation learning algorithm (SCRL) for multi-object and location-specific tasks, in order to realize the invariant spatial representation corresponding to the same cropped region under augmentations of a given image. | ||
develop/ different points | - While previous contrastive methods mainly focus on generating invariant global representations at the image-level, they are prone to overlook spatial consistency of local representations. - Especially, previous methods aggressively crop views, which leads to minimizing representation distances between the semantically different regions of a single image, i.e., performance degradation on local representation. | ||
BYOL architecture
Category | New invariance for local representation |
42
Paper | Viewmaker Networks: Learning Views for Unsupervised Representation Learning | Confe, Cite | 22, ICLR21 |
methodological points | - Viewmaker networks: generative models that learn to produce useful views from a given input. - The viewmaker is trained adversarially to output a stochastic perturbation that is added to the input. This perturbation is projected onto l_p sphere, controlling the effective strength of the view. - They use an image-to-image neural network as our viewmaker network, with an architecture adapted from work on style transfer | ||
develop/ different points | - Designing views requires considerable trial and error by human experts, hindering widespread adoption (generalization). - InfoMin is not for a broadly-applicable approach for learning views. | ||
Additional positives from viewmaker
Additional positives from viewmaker
Category | New augmentation |
43
Paper | CASTing Your Model: Learning to Localize Improves Self-Supervised Representations | Confe, Cite | 17, CVPR21 |
methodological points | - CAST (1) uses unsupervised saliency maps to intelligently sample crops and (2) to provide grounding supervision via a Grad-CAM attention loss. - The details of methods illustrate in Fig. 3 | ||
develop/ different points | - Two issues with recent CL (1) poor grounding (2) inconsistent samples [See fig.1] - Current SSL methods perform best on iconic images, and struggle on complex scene images with many objects. | ||
Issues with previous works
Prerequisite
Main method
loss 1
loss 2
Category | New invariance for local representation |
44
Paper | SCCL: Supporting Clustering with Contrastive Learning | Confe, Cite | 21, NAACL 2021 |
methodological points | - jointly optimizing (1) clustering loss with (2) instance-wise contrastive loss (InfoNCE) - Clustering loss with feature centroids. - This target distribution (p) sharpens the soft-assignment probability (q). | ||
develop/ different points | - A significant challenge of SSL clustering: different categories often overlap with each other in the representation space at the beginning of the learning | ||
InfoNCE
Category | Clustering like SwAV |
Paper | MaskCo: Self-Supervised Visual Representations Learning by Contrastive Mask Prediction | Confe, Cite | 5, ICCV21 |
methodological points | - we propose (1) a novel contrastive mask prediction (CMP) task and (2) design a mask contrast (MaskCo) framework. - Fig. 3 is easy to figure out the proposed method. | ||
develop/ different points | - Instnace discrimination task has an implicit semantic consistency (SC) assumption, but it may not hold in unconstrained datasets (CoCo, Multi objects). | ||
Issues with previous works
Proposed new task (CMP)
Method
Category | Siamese, MIM with CNN |
Paper | | | |
methodological points | Several works, however, have attempted to demystify the success of BYOL (Grill et al., 2020), a close variant of SimSiam. A technical report (Fetterman & Albrecht, 2020, BYOL1) has suggested the importance of batch normalization (BN) in BYOL for its success, however, a recent work (Richemond et al., 2020, BYOL2) refutes their claim by showing BYOL works without BN, which is discussed in Appendix B | ||
develop/ different points | | ||
Category | Understanding BYOL |
https://generallyintelligent.ai/blog/2020-08-24-understanding-self-supervised-contrastive-learning/
https://arxiv.org/abs/2010.10241
BYOL1
BYOL2