1 of 46

A Survey on Masked Autoencoder

for Self-supervised Learning in Vision and Beyond

Chaoning Zhang, Chenshuang Zhang, Junha Song, John Seon Keun Yi, In So Kweon

KAIST

International Joint Conferences on Artificial Intelligence (IJCAI 2023)

[📖 arXiv]

2 of 46

Meeting Materials

3 of 46

CL

SSL

Pretraining

Supervised

CL

GAN

Prediction

pretext task

Supervised

pretraining

Generative

SSL

Metric learning, etc

Ours

Diagram for Self-supervised learning

4 of 46

2008

2010

… …

2016

2017

2018

2019

2020

2021

2022

Discriminative SSL

Generative SSL

Denoising autoencoder

Masked modeling

Masked autoencoder

Language

Vision

Geometry- based prediction

Joint-embedding methods

Denoising Autoencoder

ICML 2008

Stacked denoising Autoencoder

JMLR 2010

Inpainting

CVPR 2016

Automatic colorization

ECCV 2016

Colorization

ECCV 2016

Colorization

as proxy task

CVPR 2017

Split brain autoencoders

CVPR 2017

GPTv3

NeurIPS 2020

GPTv2

2019

GPTv1

2018

BERT [23]

NAACL 2018

[ None ]

iGPT

ICML 2020

ViT (iBERT)

ICLR 2021

MAE

CVPR 2022

BEiT

ICLR 2022

MoCo

CVPR 2020

SimSiam

CVPR 2021

Jigsaw puzzles

ECCV 2016

Rotation prediction

ICLR 2018

Flow chart

5 of 46

InfoNCE

Clustering

2019

2021

2022

CMC

2020

MoCo

MoCo v2

SimCLR

SimCLR v2

CPC

SwAV

SEER

Barlow Twins

BYOL

SimSiam

PIRL

Deep Cluster

InfoMin

RePS

SEER v2

DINO

MoCo v3

13 SSL

SelfAugment

JigsawClust

Feature Transform

Demystify contrastive

OBoW

W-MSE

VICReg

New augmentation

New invariance loss =

Distinct Invariant attribute

Potential/Analysis of Contrastive �with large experiments

Self-sup evaluation

With Larger data

With ViT

New pretext

w/o negatives

Decorrelation

TWIST

Self-classifier

NN positives

NNCLR

ReCM

MYOW

SpectralCL

Graphic N positives

Channel-wise decorrelation

DnC

To avoid collapse,

Additional loss, regularization,�Different approach

PixPro

ReLIC

IFM

SSL-HSIC

MoCHi

Understand CL

Properties CL

How avoid

local representation

Viewmaker

CAST

SCRL

SCCL

iBOT

Splitmask

simMIM

MIM

CAE

BeIT

MAE

data2vec

Siamese

hybrid

AASAE

TIM

MSN

MaskCo

6 of 46

Decoder

Tokenizer

123 234 456 567

987 876 765 543

112 223 334 445

211 322 433 544

Stage1

Stage2

456

876

765

322

7 of 46

7

Paper

InfoMin: What Makes for Good views for Contrastive Learning?

Confe, Cite

MIT Google, 344

methodological points

- The views should only share the label information. (Augmented images that share other information are not a dataset optimized for contrastive learning.)�- They apply a learnable and invertible view generator with and unsupervised adversarial objective, which minimizes mutual infomation (MI) but keep class/label information.

- A generator aims to synthesize two invertible views from the same image x while minimizing mutual information I except for task-relevant information

develop/

different points

- Study the influence (importance) of different view choices.

Category

New augmentation

8 of 46

8

Paper

Feature Transform: Improving Contrastive Learning by Visualizing Feature Transformation

Confe, Cite

7, ICCV21

methodological points

- Data augmentation -> Feature transformation- Observation: Harder positives and negative boost the learning (positive but low-similarity)

- Proposal: Feature transformation (for harder positive) and the interpolation among negative (for harder negative) (As shown in Figure 6, the former is a method that pushes two vectors to each other, and the latter is a method that makes two vectors pull each other.

develop/

different points

- differing from data augmentation, they attempt a feature-level data manipulation.

Category

New augmentation

9 of 46

9

Paper

Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

Confe, Cite

78, CMU&FAIR

methodological points

- (finding1) MoCo and PIRL succeed in occlusion-invariance, but fail viewpoint & category invariance which are crucial components for OD. I.e., sup-optimal representation.

- (finding2) Despite bad viewpoint invariance, ImageNet ensures correspondence btw different crops for a self-training model.

- This work propose the use of videos dataset for a more natural form for the instance discrimination task, which leads to higher viewpoint invariance."

develop/

different points

- Previous methods can understand only occlusion-invariance (by cropping)

Category

Potential / Analysis of

contrastive learning

10 of 46

10

Paper

W-MSE: Whitening for self supervised representation learning

Confe, Cite

37, ICML21

methodological points

- Whitening-MSE method whitens, (normalizes=) spreads out the embedding vectors on the unit sphere, and attract each positive with MSE loss.

- The whitening operation has a “scattering” effect on the batch samples. see Fig. 1."

develop/

different points

- The same as VICReg.

Category

New invariance loss

11 of 46

11

Paper

OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning

Confe, Cite

12, CVPR21

methodological points

- To learn contextual reasoning, they encourage a model to reconstruct a bag-of-visual-words (BoW) representation.

- A bag-of-visual-words is updated online, which is similar to the update of code in SwAV"

develop/

different points

- Compared to perturbation invariance, contextual reasoning might be favorable for reconstruction (generative pretext).

- (1) OBoW exploits a BoW prediction task while SwAV uses an image-cluster prediction task. (2) a BoW encodes all the local feature (all elements) of an image but image-cluster assignment encodes only one global feature(centroids). Therefore, BoW targets are a richer representation."

Category

Different Invariance

/ New invariance loss

12 of 46

12

Paper

PixPro: Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning

Confe, Cite

MS, 90, CVPR21

methodological points

- Pixel-level contrastive learning: the feature from the corresponding pixels of the two views are encouraged to be consistent.

- The first task, pixel contrastive task (PixContrast), was inferior to pixel consistency task (PixPro). But PixPro require PPM module which allows asymmetric architecture such as predictor and Sinkhorn-Knopp to avoid collapse."

develop/

different points

- Current methods are trained only on instance-level pretext tasks. (crop regions are represent as instance-level)

Category

New invariance loss

for local representation

13 of 46

13

Paper

ReLIC: Representation Learning via Invariant Causal Mechanisms

Confe, Cite

Deepmind, 51

methodological points

- Invariance objective (constraints) on pre-trained classifiers is required for effective data augmentation. (Figure1 (b) is enough to understand whole paper)

- the principle of invariant prediction in BYOL is a sufficient condition for representation learning."

develop/

different points

- Show analysis of generalization ability and robustness, unlike other papers.

- Alternative explanation to similarity objective (mutual info) in BYOL is provided.

- This framework (objective) can be used even in any vision tasks, unlike vanilla contrastive loss."

Category

New invariance loss

for any visual task

14 of 46

14

Paper

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Confe, Cite

FAIR, 277

methodological points

- New objective function naturally avoiding collapse (cross-correlation matrix -> the identity matrix)

- This causes vectors of distorted samples to be similar, while minimizing the redundancy (i channel feature != j channel feature)"

develop/

different points

- The method does not require large batches, asymmetry, stop-gradient, and momentum update.

Category

New invariance loss

15 of 46

15

Paper

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Confe, Cite

59, FAIR, ICCV21

methodological points

- Two regularizations terms for avoiding the collapse and stabilizing the training (1) maintaining the variance above a threshold (2) decorrelating each pair

- See Figure 2: conceptual comparison

develop/

different points

- Unlike most approaches, VICReg does not require techniques, such as: large batch, negative sample, weight sharing (sharing weight and different weights, both work well in VIC), BN or feature norm, output quantization(SwAV), stop gradient, memory bank, momentum encoder, etc.

- Without asymmetric architecture, Barlow, whitening, and VIC aim to maximize the information content of the embedding. Additionally, VIC prevents informational collapse by decorrelating."

the covariance regularization

the variance regularization

the sum of the squared off-diagonal coefficients

Category

New invariance loss

16 of 46

16

Paper

Decorrelation: On Feature Decorrelation in Self-Supervised Learning

Confe, Cite

20, ICCV21

methodological points

- ( whitening = standardizing covariance = feature decorrelation )

- Concise framework: simple symmetric architecture (augmented views, encoder, MSE)

- Collapse can be divided into complete collapse (constant) & Dimensional collapse (=The projected features collapse into a low-dimensional manifold, i.e., a strong correlation between axes).

- The use of decorrelated Batch Normalization layer (DBN) (1) Algorithm: divide feature into groups and applies whitening (2) Effect: Alleviating dimensional collapse issue. "

develop/

different points

- First dimensional collapse (difference from complete collapse)

Category

To avoid Collapse \

New invariance loss

17 of 46

17

18 of 46

18

Paper

RePS: Rethinking Pre-training and Self-training

Confe, Cite

GoogleBrain, 239

methodological points

- The commen sence is that the model pre-trained on ImageNet has limitation compared to the model pretrained on COCO as argued by Kaiming He.

- How about self-training model???

- Results: (1) strong augmentation and more labeled data is not always good for 'pre-training' but 'self-training' (2) self-training improves upon pre-training.

- Conclusion: Self-training has more enormous possibilities than pre-training."

develop/

different points

- Comparison between the potentiality of pre-training with a large dataset and self-training with contrastive for other computer vision tasks.

Category

Potential of self-training

19 of 46

19

Paper

13SSL: How Well Do Self-Supervised Models Transfer?

Confe, Cite

40, CVPR21

methodological points

- We evaluate the transfer performance of 13 top self-supervised models on 40 downstream tasks, including many-shot and few-shot recognition, object detection, and dense prediction.

- They propose several open questions and answer four questions. Refer to Introduction. It is easy to understand."

develop/

different points

- Despite significant progress, there are a number of important open questions.

- We need large-scale evaluation."

Category

Comparison of SLL methods

Potential of self-training

20 of 46

20

Paper

SelfAugment: Automatic Augmentation Policies for Self-Supervised Learning

Confe, Cite

10, CVPR21

methodological points

- Evaluating with a self-supervised image rotation task is highly correlated with a standard set of supervised evaluations

- Automatic and efficient augmentation selection using this correlation.

- Self-sup evaluation for filter hundreds for augmentation, training setting, and network architectures."

develop/

different points

- Supervised evaluation requiring label data is not practical. (method의 효과를 측정하기 위해서 labeled dataset을 사용해서 Evaluation하는 것)

Category

Self-sup evaluation

21 of 46

21

Paper

Jigsaw Clustering for Unsupervised Visual Representation Learning

Confe, Cite

7, CVPR21

methodological points

- Batch images are split, shuffled, and stiched. The target is to recover disrupted pars to the original images like a puzzle game.

- Previous Jigsaw SSL permute patches in a single image, but this work does in a batch so that the network learn both intra- and inter- images information.

- It consists of clustering branch and location branch. "

develop/

different points

- Incremental works of Jigsaw puzzles.

- Contrastive methods duplicate each training batch, i.e, dual batch. In contrast, we construct a single-batch."

Category

New pretext task

(not contrastive)

22 of 46

22

Paper

CAE: Context Autoencoder for Self-Supervised Representation Learning

Confe, Cite

methodological points

- Two new modules: (1) Latent contextual regressor (2) Alignment loss module

- (Insight) Caring about the representations for the patches, MIM (masked image modeling) methods learns semantics of both the center regions (1000classes) and other regions (potential class). This is different from contrastive methods that tend to learn semantics mainly from the center patches. (See Fig. 6)"

develop/

different points

- 1. De-coupling the encoding (content understanding) and decoding roles(making predictions for masked patches)

- 2. masked patches are also cast into the latent space.“

Category

New MAE

23 of 46

23

Paper

DINO: Emerging Properties in Self-Supervised Vision Transformers (DINO: self-DIstillation with NO labels.)

Confe, Cite

FAIR, 276

methodological points

"- The attention map extracted from SSL+ViT model prove that the dependence on the background is very low.

- Key methods (1) Cross-entropy loss with EMA teacher (mean teacher) model = knowledge distillation, (2) multi-crop (224x224 for teacher, 96x96 for student) (3) centering (momentum updated center is used to subtract output feature.) and sharpening (low temperature value in softmax) in order to escape collapse-solution.

develop/

different points

- they interpret cross-entropy loss with EMA teacher model as knowledge distillation

- In order to escape collapse, previous works used contrastive loss with negative samples, clustering constraints, predictor, and batch normalizations. But they propose Centering and sharpening.

Category

Potential in Transformer

Centering

24 of 46

24

Paper

SEER: Self-supervised Pretraining of Visual Features in the Wild

Confe, Cite

FAIR, 64

methodological points

- large models on random and uncurated images (Instagram images) = a RegNetY with 1.3B parameters trained on 1B random images.

- Use SwAV method. To save GPU memory, they use, aka, gradient checkpointing and mixed precision.

develop/

different points

- Datasets originally collected for sup and weakly-wup represent a limited fraction of the general distribution.

Category

In Larger data,

Potential of self-training

25 of 46

25

Paper

SEER 10B para: Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision

Confe, Cite

FAIR, 1

methodological points

- FAIR's project result paper, is a training model on diverse, real, and unfiltered internet data without any data pre-processing. �- They demonstrate that such a model has properties like geographical diversity, fairness, robustness, multilingual hashtag embeddings, artistic and better semantic information.�- Experiments prove that such a model is more robust, fairer, less harmful, and less biased than supervised models.

develop/

different points

- Prior methods assume that, with the dataset they used, they produce features that are general enough to be re-used as they are in a variety of supervised tasks.

https://ai.facebook.com/blog/seer-10b-better-fairer-computer-vision-through-self-supervised-learning-training-on-diverse-datasets/

Category

In much larger data,

Potential of self-training

26 of 46

26

Paper

MSN: Masked Siamese Networks for Label-Efficient Learning

Confe, Cite

0

methodological points

- swapping assignment of SvAW / Knowledge distillation (momentum+CE loss) of DINO / Masking = strong argument = student

- MSN does not predict the masked patches at the input level but rather performs the denoising step (self denoised-representation learning) implicitly at the representation level by ensuring that the representation of the masked input matches the representation of the unmasked one.

develop/

different points

- Especially, it has competitive performance in low-shot classification, compared to DINO and MAE.

- Joint-embedding architectures (Siamese) avoid reconstruction, but apply image transformation.

Category

Siamese with MIM

27 of 46

27

Paper

AASAE: Augmentation-Augmented Stochastic Autoencoders

Confe, Cite

1

methodological points

- Autoencoder (reconstruction task) with image transformation, not image masking.

develop/

different points

- A unique approach (;= recon?) different from both contrastive and non-contrastive (w/o negatives)

- No negatives / Small mini-batch

Category

TIM

(transformed info modeling(recon))

28 of 46

28

Paper

TWIST: Self-Supervised Learning by Estimating Twin Class Distributions

Confe, Cite

2

methodological points

- TWIST = Twin Class Distribution Estimation

- (1) twin (consistent) class distributions of two augmented images.

- To avoid collapse, they (2. sharpness term) minimize the entropy of the distribution for each sample to make the class prediction (3. diversity term) maximize the entropy of the mean distribution to make the predictions of different samples diverse.

develop/

different points

- Without asymmetric / stop-gradient / momentum encoder

- They show the comparisons of TWIST to DINO, SwAV, Self-classifier, and Barlow Twins.

Category

New invariance loss

29 of 46

29

Paper

Self-Supervised Classification Network

Confe, Cite

4, IBM

methodological points

- Self-classifier (not predictor, projector, really for classification) learns labels and representations simultaneously

- To avoid degenerate solutions, (1) variant of the cross-entropy (2) asserting a uniform prior on class predictions

(1) cross entropy -> Bayes laws -> new form of CE loss, which is mathematically equivalent to CE loss, but has the ability to avoid collapse. (Proof is in Sec. 4)

(2) Uniform prior: p(y) = 1/C || p(x) = 1/N

develop/

different points

- Without a single label, 41.1% ImageNet accuracy.

- It does not require a memory bank, a second network(momentum), external clustering, stop-gradient operation, or negative pairs.

Uniform Prior

It make the model avoid collapse

Category

new form of CE loss

30 of 46

30

Paper

iBOT: Image BERT Pre-Training with Online Tokenizer

Confe, Cite

22.ICLR22

methodological points

- iBoT performs masked prediction with an online tokenizer, which is taken as the momentum-updated teacher network.

- Loss (1) self-distillation on [CLS] token like DINO (2) MIM objective

develop/

different points

- BEiT uses a pre-trained discrete VAE as the tokenizer only capturing low-level semantics within local details. Moreover, the tokenizer needs to be offline pre-trained which leads to potential limitations in its adaptivity to perform MIM using data from different domains.

- Our tokenizer captures high-level visual semantics and needs no extra stages of training

shared

Category

Siamese with MIM

31 of 46

31

Paper

NNCLR: With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations

Confe, Cite

45, ICCV 2021, Deepmind

methodological points

- NNCLR samples the nearest neighbors from the dataset in the latent space and treats them as positives where more semantic variations can be provided.

- The proposed method is less reliant on complex data augmentations. Specifically, a relative reduction of only 2.1% ImageNet accuracy is seen when we train using only random crops.

develop/

different points

- Most CL papers treat different views (by deformations) as positives, but they are interested in using positives from other instances in the dataset.

- Past clustering-based methods also consider single instance positives, but assuming the entire cluster (or its prototype) to be positives could hurt performance due to early over-generalization.

- In previous works, the onus of generalization lies heavily on the data augmentation pipeline, which cannot cover all the variances in a given class.

Category

NN positives

32 of 46

32

Paper

ReCM: Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

Confe, Cite

3, NIPS21

methodological points

- The first research about how biases in the dataset affect (not only ImageNet) existing methods. For example, MoCo performs surprisingly well in object- and scene-centric datasets.

- The study about additional invariances such as (1) multi-scale cropping, (2) stronger augmentation policy, and (3) nearest neighbors.

-> A key component is the augmentation strategy which involves random cropping. (:= masked information)

-> Why? when multiple objects are present, a positive pair of non-overlapping crops could lead us to wrongfully match the feature representations of different objects.

develop/

different points

- most methods still train on images from ImageNet, whose properties are (1) a single object in the center of the image (2) a class-uniform distribution ...

- To deploy self-supervised learning into the wild, we need to quantify the dependence on these properties.

Category

Potential / Analysis of

contrastive learning

33 of 46

33

Paper

MYOW: Mine Your Own vieW: Self-Supervised Learning Through Across-Sample Prediction

Confe, Cite

11, Georgia Tech

methodological points

- For contrastive learning without sufficient diversity in the transformations, MYOW utilizes the dataset itself to find similar samples (=minded views),

i.e., finding the samples using the k-nearest neighbors of the anchor representation among the target representations of the pool of candidates.

develop/

different points

- Previous works are challenging to find the right balance between augmentations that not only introduce sufficient diversity but also preserve the semantics of the original data.

- MYOW obtains more diversity by finding diverse views within the dataset, not limited to a single instance.

Dataset

Candidates

feature

space

select

the mined view

projector

?

Category

NN positives

34 of 46

34

Paper

SpectralCL: Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss

Confe, Cite

22.NIPS21

methodological points

- Positive pairs from the population augmentation graph leveraging continuity of the population data within the same class (=close neighbors within the same class).

develop/

different points

- Conditional independence of positive pairs -> (past) same class label -> (current) augmented views (not practical) -> (ours) samples in 'augmentation graph'.

?! T.T

Category

Graphic N positives

35 of 46

35

Paper

DnC: Divide and Contrast: Self-supervised Learning from Uncurated Data

Confe, Cite

12, Deepmind ICCV 2021

methodological points

- How to handle larger, less-curated datasets such as YFCC, different from ImageNet which is not a general, diverse, uncurated (사람의 가공작업(1000개의 클래스로만 제한 등)이 많이 들어간 데이터셋) dataset.

- The base model learns global consistency while the expert model captures local consistency in subsets of YFCC.

- a stronger baseline MoCLR (= SimCLR + BYOL) is suggested.

develop/

different points

- So, Btw ImageNet and YFCC, 'curation gap' exist due to a shift in the distribution of image classes.

- [figure 1] DnC is better able to handle the diverse and long-tailed distribution of images and improves more with longer training then BYOL, MoCLR(SimCLR+BYOL)

Pre-trained�SimCLR & BYOL�(= MoCLR)�-> base model

Re-training�MoCLR�-> Expert model

Expert model

base model

base model

k-th expert model

Category

How to handle Larger data

36 of 46

36

Paper

IFM: Can contrastive learning avoid shortcut solutions?

Confe, Cite

8, NIPS21

methodological points

- To learn representation without inductive biases (generalization), implicit feature modification (IFM) is proposed, which can plug-and-play to any kind of SSL methods.

- They propose modifying features by applying transformations to encoded samples v = f(x). Since they modify the encoded samples, instead of raw inputs x, we describe our method as implicit.

- For generalized representation by SSL, the embeddings of positive and negative samples are modified to remove well represented features.

- Rather than applying extremely challenging perturbations, it is advantageous to learn generalized representations to transform the features of positive and negative features that are easy to discriminate.

develop/

different points

- Representation learned from CL is likely to inadvertently suppress important predictive features on downstream tasks. (=shortcuts solution, inductive biases) i.e., although hard augmentation improves ImageNet accuracy, it can suppress well-represented features.

strong raw image permutation can not be always good.

feature permutation

InfoNCE + IFM

InfoNCE

Category

New augmentation

37 of 46

37

Paper

SSL-HSIC: Self-Supervised Learning with Kernel Dependence Maximization

Confe, Cite

7, NIPS21

methodological points

- maximizing dependence(similarity) between representations of two views while minimizing the kernelized variance.

- [Mathemetically] To measure dependence, Hilbert-Schmidt Independence Criterion (HSIC) is used. The proposed loss called SSL-HSIC loss is inspired by HSIC Bottleneck.

- InfoNCE can be thought of as 'SSL-HSIC with a variance-based regularization'.

develop/

different points

- SSL-HSIC loss itself penalizes trivial solutions, so techniques such as target networks are not needed .

Category

New invariance loss

38 of 46

38

Paper

MoCHi: Hard Negative Mixing for Contrastive Learning

Confe, Cite

153, NIPS20

methodological points

- (Finding) Harder negatives are needed to facilitate better and faster learning. Thus, hard negative mixing strategies at the feature level are proposed.

- Synthesizing hard negatives: mixing some of the hardest negative features of the contrastive loss.

- the hardest negatives mean the closest features to each positive.

develop/

different points

- So far, large batch and large memory banks are used to get more meaningful negative samples.

Category

New negative feature

augmentation

39 of 46

39

Paper

Understanding CL: Understanding the Behaviour of Contrastive Loss

Confe, Cite

63, CVPR21

methodological points

- This paper simply analyzes the importance and effect of temperature on contrastive loss.

- Summary: The smaller the temperature, the more similar the features. Conversely, if the temperature is large, hard negatives are generated, so good features for downstream can be learned.

- Analysis (1) the contrastive loss is a hardness-aware loss function, (Sec. 3) which automatically concentrates on separating more informative negative samples to make the embedding distribution more uniform. So, the temperature is important.

(2) the temperature τ controls the strength of penalties on hard negative samples.

develop/

different points

- To learn separable features, tolerant (robust) to semantically similar samples, and useful features for downstream, temperature is important.

Category

The effect of

temperature in CL

40 of 46

40

Paper

Property CL: Intriguing Properties of Contrastive Losses

Confe, Cite

34, NIPS21

methodological points

- This paper aims to understand the effectiveness and limitation of existing contrastive learning.

- Aspect 1: (Sec 2) a generalization of the standard contrastive loss.

- Aspect 2: (Sec 3) whether or not it leads to meaningful local features on images with multiple objects present. -> SimCLR can learn on images with multiple objects

- Aspect 3: (Sec 4) The feature suppression phenomenon (For example, “color distribution” and “object class” are often competing-features (trade-off))

develop/

different points

- Existing contrastive learning methods critically rely on data augmentation to favor certain sets of features over others, and could suffer from learning saturation for scenarios.

Category

Analysis of CL

41 of 46

41

Paper

SCRL: Spatially Consistent Representation Learning

Confe, Cite

28, CVPR21

methodological points

- They propose a spatially consistent representation learning algorithm (SCRL) for multi-object and location-specific tasks,

in order to realize the invariant spatial representation corresponding to the same cropped region under augmentations of a given image.

develop/

different points

- While previous contrastive methods mainly focus on generating invariant global representations at the image-level, they are prone to overlook spatial consistency of local representations.

- Especially, previous methods aggressively crop views, which leads to minimizing representation distances between the semantically different regions of a single image, i.e., performance degradation on local representation.

BYOL architecture

Category

New invariance

for local representation

42 of 46

42

Paper

Viewmaker Networks: Learning Views for Unsupervised Representation Learning

Confe, Cite

22, ICLR21

methodological points

- Viewmaker networks: generative models that learn to produce useful views from a given input.

- The viewmaker is trained adversarially to output a stochastic perturbation that is added to the input. This perturbation is projected onto l_p sphere, controlling the effective strength of the view.

- They use an image-to-image neural network as our viewmaker network, with an architecture adapted from work on style transfer

develop/

different points

- Designing views requires considerable trial and error by human experts, hindering widespread adoption (generalization).

- InfoMin is not for a broadly-applicable approach for learning views.

Additional positives from viewmaker

Additional positives from viewmaker

Category

New augmentation

43 of 46

43

Paper

CASTing Your Model: Learning to Localize Improves Self-Supervised Representations

Confe, Cite

17, CVPR21

methodological points

- CAST (1) uses unsupervised saliency maps to intelligently sample crops and (2) to provide grounding supervision via a Grad-CAM attention loss.

- The details of methods illustrate in Fig. 3

develop/

different points

- Two issues with recent CL (1) poor grounding (2) inconsistent samples [See fig.1]

- Current SSL methods perform best on iconic images, and struggle on complex scene images with many objects.

Issues with previous works

Prerequisite

Main method

loss 1

loss 2

Category

New invariance

for local representation

44 of 46

44

Paper

SCCL: Supporting Clustering with Contrastive Learning

Confe, Cite

21, NAACL 2021

methodological points

- jointly optimizing (1) clustering loss with (2) instance-wise contrastive loss (InfoNCE)

- Clustering loss with feature centroids.

- This target distribution (p) sharpens the soft-assignment probability (q).

develop/

different points

- A significant challenge of SSL clustering: different categories often overlap with each other in the representation space at the beginning of the learning

InfoNCE

Category

Clustering like SwAV

45 of 46

Paper

MaskCo: Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

Confe, Cite

5, ICCV21

methodological points

- we propose (1) a novel contrastive mask prediction (CMP) task and (2) design a mask contrast (MaskCo) framework.

- Fig. 3 is easy to figure out the proposed method.

develop/

different points

- Instnace discrimination task has an implicit semantic consistency (SC) assumption, but it may not hold in unconstrained datasets (CoCo, Multi objects).

Issues with previous works

Proposed new task (CMP)

Method

Category

Siamese, MIM with CNN

46 of 46

Paper

methodological points

Several works, however, have attempted to demystify the success of BYOL (Grill et al., 2020), a close variant of SimSiam. A technical report (Fetterman & Albrecht, 2020, BYOL1) has suggested the importance of batch normalization (BN) in BYOL for its success, however, a recent work (Richemond et al., 2020, BYOL2) refutes their claim by showing BYOL works without BN, which is discussed in Appendix B

develop/

different points

Category

Understanding BYOL

https://generallyintelligent.ai/blog/2020-08-24-understanding-self-supervised-contrastive-learning/

https://arxiv.org/abs/2010.10241

BYOL1

BYOL2