4 of 46

2008	…	2010	… …	2016	2017	2018	2019	2020	2021	2022

Discriminative SSL

Generative SSL

Denoising autoencoder

Masked modeling

Masked autoencoder

Language

Vision

Geometry- based prediction

Joint-embedding methods

Denoising Autoencoder

ICML 2008

Stacked denoising Autoencoder

JMLR 2010

Inpainting

CVPR 2016

Automatic colorization

ECCV 2016

Colorization

ECCV 2016

Colorization

as proxy task

CVPR 2017

Split brain autoencoders

CVPR 2017

GPTv3

NeurIPS 2020

GPTv2

2019

GPTv1

2018

BERT [23]

NAACL 2018

[ None ]

iGPT

ICML 2020

ViT (iBERT)

ICLR 2021

MAE

CVPR 2022

BEiT

ICLR 2022

MoCo

CVPR 2020

SimSiam

CVPR 2021

Jigsaw puzzles

ECCV 2016

Rotation prediction

ICLR 2018

Flow chart

5 of 46

InfoNCE

Clustering

2019

2021

2022

CMC

2020

MoCo

MoCo v2

SimCLR

SimCLR v2

CPC

SwAV

SEER

Barlow Twins

BYOL

SimSiam

PIRL

Deep Cluster

InfoMin

RePS

SEER v2

DINO

MoCo v3

13 SSL

SelfAugment

JigsawClust

Feature Transform

Demystify contrastive

OBoW

W-MSE

VICReg

New augmentation

New invariance loss =

Distinct Invariant attribute

Potential/Analysis of Contrastive �with large experiments

Self-sup evaluation

With Larger data

With ViT

New pretext

w/o negatives

Decorrelation

TWIST

Self-classifier

NN positives

NNCLR

ReCM

MYOW

SpectralCL

Graphic N positives

Channel-wise decorrelation

DnC

To avoid collapse,

Additional loss, regularization,�Different approach

PixPro

ReLIC

IFM

SSL-HSIC

MoCHi

Understand CL

Properties CL

How avoid

local representation

Viewmaker

CAST

SCRL

SCCL

iBOT

Splitmask

simMIM

MIM

CAE

BeIT

MAE

data2vec

Siamese

hybrid

AASAE

TIM

MSN

MaskCo

6 of 46

Decoder

Tokenizer

123 234 456 567

987 876 765 543

112 223 334 445

211 322 433 544

Stage1

Stage2

456

876

765

322

7 of 46

Paper	InfoMin: What Makes for Good views for Contrastive Learning?	Confe, Cite	MIT Google, 344
methodological points	- The views should only share the label information. (Augmented images that share other information are not a dataset optimized for contrastive learning.)�- They apply a learnable and invertible view generator with and unsupervised adversarial objective, which minimizes mutual infomation (MI) but keep class/label information. - A generator aims to synthesize two invertible views from the same image x while minimizing mutual information I except for task-relevant information
develop/ different points	- Study the influence (importance) of different view choices.

Category	New augmentation

8 of 46

Paper	Feature Transform: Improving Contrastive Learning by Visualizing Feature Transformation	Confe, Cite	7, ICCV21
methodological points	- Data augmentation -> Feature transformation- Observation: Harder positives and negative boost the learning (positive but low-similarity) - Proposal: Feature transformation (for harder positive) and the interpolation among negative (for harder negative) (As shown in Figure 6, the former is a method that pushes two vectors to each other, and the latter is a method that makes two vectors pull each other.
develop/ different points	- differing from data augmentation, they attempt a feature-level data manipulation.

Category	New augmentation

9 of 46

Paper	Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases	Confe, Cite	78, CMU&FAIR
methodological points	- (finding1) MoCo and PIRL succeed in occlusion-invariance, but fail viewpoint & category invariance which are crucial components for OD. I.e., sup-optimal representation. - (finding2) Despite bad viewpoint invariance, ImageNet ensures correspondence btw different crops for a self-training model. - This work propose the use of videos dataset for a more natural form for the instance discrimination task, which leads to higher viewpoint invariance."
develop/ different points	- Previous methods can understand only occlusion-invariance (by cropping)

Category	Potential / Analysis of contrastive learning

10 of 46

Paper	W-MSE: Whitening for self supervised representation learning	Confe, Cite	37, ICML21
methodological points	- Whitening-MSE method whitens, (normalizes=) spreads out the embedding vectors on the unit sphere, and attract each positive with MSE loss. - The whitening operation has a “scattering” effect on the batch samples. see Fig. 1."
develop/ different points	- The same as VICReg.

Category	New invariance loss

11 of 46

Paper	OBoW: Online Bag-of-Visual-Words Generation for Self-Supervised Learning	Confe, Cite	12, CVPR21
methodological points	- To learn contextual reasoning, they encourage a model to reconstruct a bag-of-visual-words (BoW) representation. - A bag-of-visual-words is updated online, which is similar to the update of code in SwAV"
develop/ different points	- Compared to perturbation invariance, contextual reasoning might be favorable for reconstruction (generative pretext). - (1) OBoW exploits a BoW prediction task while SwAV uses an image-cluster prediction task. (2) a BoW encodes all the local feature (all elements) of an image but image-cluster assignment encodes only one global feature(centroids). Therefore, BoW targets are a richer representation."

Category	Different Invariance / New invariance loss

12 of 46

Paper	PixPro: Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning	Confe, Cite	MS, 90, CVPR21
methodological points	- Pixel-level contrastive learning: the feature from the corresponding pixels of the two views are encouraged to be consistent. - The first task, pixel contrastive task (PixContrast), was inferior to pixel consistency task (PixPro). But PixPro require PPM module which allows asymmetric architecture such as predictor and Sinkhorn-Knopp to avoid collapse."
develop/ different points	- Current methods are trained only on instance-level pretext tasks. (crop regions are represent as instance-level)

Category	New invariance loss for local representation

13 of 46

Paper	ReLIC: Representation Learning via Invariant Causal Mechanisms	Confe, Cite	Deepmind, 51
methodological points	- Invariance objective (constraints) on pre-trained classifiers is required for effective data augmentation. (Figure1 (b) is enough to understand whole paper) - the principle of invariant prediction in BYOL is a sufficient condition for representation learning."
develop/ different points	- Show analysis of generalization ability and robustness, unlike other papers. - Alternative explanation to similarity objective (mutual info) in BYOL is provided. - This framework (objective) can be used even in any vision tasks, unlike vanilla contrastive loss."

Category	New invariance loss for any visual task

14 of 46

Paper	Barlow Twins: Self-Supervised Learning via Redundancy Reduction	Confe, Cite	FAIR, 277
methodological points	- New objective function naturally avoiding collapse (cross-correlation matrix -> the identity matrix) - This causes vectors of distorted samples to be similar, while minimizing the redundancy (i channel feature != j channel feature)"
develop/ different points	- The method does not require large batches, asymmetry, stop-gradient, and momentum update.

Category	New invariance loss

15 of 46

Paper	VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning	Confe, Cite	59, FAIR, ICCV21
methodological points	- Two regularizations terms for avoiding the collapse and stabilizing the training (1) maintaining the variance above a threshold (2) decorrelating each pair - See Figure 2: conceptual comparison
develop/ different points	- Unlike most approaches, VICReg does not require techniques, such as: large batch, negative sample, weight sharing (sharing weight and different weights, both work well in VIC), BN or feature norm, output quantization(SwAV), stop gradient, memory bank, momentum encoder, etc. - Without asymmetric architecture, Barlow, whitening, and VIC aim to maximize the information content of the embedding. Additionally, VIC prevents informational collapse by decorrelating."

the covariance regularization

the variance regularization

the sum of the squared off-diagonal coefficients

Category	New invariance loss

16 of 46

Paper	Decorrelation: On Feature Decorrelation in Self-Supervised Learning	Confe, Cite	20, ICCV21
methodological points	- ( whitening = standardizing covariance = feature decorrelation ) - Concise framework: simple symmetric architecture (augmented views, encoder, MSE) - Collapse can be divided into complete collapse (constant) & Dimensional collapse (=The projected features collapse into a low-dimensional manifold, i.e., a strong correlation between axes). - The use of decorrelated Batch Normalization layer (DBN) (1) Algorithm: divide feature into groups and applies whitening (2) Effect: Alleviating dimensional collapse issue. "
develop/ different points	- First dimensional collapse (difference from complete collapse)

Category	To avoid Collapse \ New invariance loss

18 of 46

Paper	RePS: Rethinking Pre-training and Self-training	Confe, Cite	GoogleBrain, 239
methodological points	- The commen sence is that the model pre-trained on ImageNet has limitation compared to the model pretrained on COCO as argued by Kaiming He. - How about self-training model??? - Results: (1) strong augmentation and more labeled data is not always good for 'pre-training' but 'self-training' (2) self-training improves upon pre-training. - Conclusion: Self-training has more enormous possibilities than pre-training."
develop/ different points	- Comparison between the potentiality of pre-training with a large dataset and self-training with contrastive for other computer vision tasks.

Category	Potential of self-training

19 of 46

Paper	13SSL: How Well Do Self-Supervised Models Transfer?	Confe, Cite	40, CVPR21
methodological points	- We evaluate the transfer performance of 13 top self-supervised models on 40 downstream tasks, including many-shot and few-shot recognition, object detection, and dense prediction. - They propose several open questions and answer four questions. Refer to Introduction. It is easy to understand."
develop/ different points	- Despite significant progress, there are a number of important open questions. - We need large-scale evaluation."

Category	Comparison of SLL methods Potential of self-training

20 of 46

Paper	SelfAugment: Automatic Augmentation Policies for Self-Supervised Learning	Confe, Cite	10, CVPR21
methodological points	- Evaluating with a self-supervised image rotation task is highly correlated with a standard set of supervised evaluations - Automatic and efficient augmentation selection using this correlation. - Self-sup evaluation for filter hundreds for augmentation, training setting, and network architectures."
develop/ different points	- Supervised evaluation requiring label data is not practical. (method의 효과를 측정하기 위해서 labeled dataset을 사용해서 Evaluation하는 것)

Category	Self-sup evaluation

21 of 46

Paper	Jigsaw Clustering for Unsupervised Visual Representation Learning	Confe, Cite	7, CVPR21
methodological points	- Batch images are split, shuffled, and stiched. The target is to recover disrupted pars to the original images like a puzzle game. - Previous Jigsaw SSL permute patches in a single image, but this work does in a batch so that the network learn both intra- and inter- images information. - It consists of clustering branch and location branch. "
develop/ different points	- Incremental works of Jigsaw puzzles. - Contrastive methods duplicate each training batch, i.e, dual batch. In contrast, we construct a single-batch."

Category	New pretext task (not contrastive)

22 of 46

Paper	CAE: Context Autoencoder for Self-Supervised Representation Learning	Confe, Cite
methodological points	- Two new modules: (1) Latent contextual regressor (2) Alignment loss module - (Insight) Caring about the representations for the patches, MIM (masked image modeling) methods learns semantics of both the center regions (1000classes) and other regions (potential class). This is different from contrastive methods that tend to learn semantics mainly from the center patches. (See Fig. 6)"
develop/ different points	- 1. De-coupling the encoding (content understanding) and decoding roles(making predictions for masked patches) - 2. masked patches are also cast into the latent space.“

Category	New MAE

23 of 46

Paper	DINO: Emerging Properties in Self-Supervised Vision Transformers (DINO: self-DIstillation with NO labels.)	Confe, Cite	FAIR, 276
methodological points	"- The attention map extracted from SSL+ViT model prove that the dependence on the background is very low. - Key methods (1) Cross-entropy loss with EMA teacher (mean teacher) model = knowledge distillation, (2) multi-crop (224x224 for teacher, 96x96 for student) (3) centering (momentum updated center is used to subtract output feature.) and sharpening (low temperature value in softmax) in order to escape collapse-solution.
develop/ different points	- they interpret cross-entropy loss with EMA teacher model as knowledge distillation - In order to escape collapse, previous works used contrastive loss with negative samples, clustering constraints, predictor, and batch normalizations. But they propose Centering and sharpening.

Category	Potential in Transformer Centering

24 of 46

Paper	SEER: Self-supervised Pretraining of Visual Features in the Wild	Confe, Cite	FAIR, 64
methodological points	- large models on random and uncurated images (Instagram images) = a RegNetY with 1.3B parameters trained on 1B random images. - Use SwAV method. To save GPU memory, they use, aka, gradient checkpointing and mixed precision.
develop/ different points	- Datasets originally collected for sup and weakly-wup represent a limited fraction of the general distribution.

Category	In Larger data, Potential of self-training

25 of 46

Paper	SEER 10B para: Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision	Confe, Cite	FAIR, 1
methodological points	- FAIR's project result paper, is a training model on diverse, real, and unfiltered internet data without any data pre-processing. �- They demonstrate that such a model has properties like geographical diversity, fairness, robustness, multilingual hashtag embeddings, artistic and better semantic information.�- Experiments prove that such a model is more robust, fairer, less harmful, and less biased than supervised models.
develop/ different points	- Prior methods assume that, with the dataset they used, they produce features that are general enough to be re-used as they are in a variety of supervised tasks. https://ai.facebook.com/blog/seer-10b-better-fairer-computer-vision-through-self-supervised-learning-training-on-diverse-datasets/

Category	In much larger data, Potential of self-training

26 of 46

Paper	MSN: Masked Siamese Networks for Label-Efficient Learning	Confe, Cite	0
methodological points	- swapping assignment of SvAW / Knowledge distillation (momentum+CE loss) of DINO / Masking = strong argument = student - MSN does not predict the masked patches at the input level but rather performs the denoising step (self denoised-representation learning) implicitly at the representation level by ensuring that the representation of the masked input matches the representation of the unmasked one.
develop/ different points	- Especially, it has competitive performance in low-shot classification, compared to DINO and MAE. - Joint-embedding architectures (Siamese) avoid reconstruction, but apply image transformation.

Category	Siamese with MIM

27 of 46

Paper	AASAE: Augmentation-Augmented Stochastic Autoencoders	Confe, Cite	1
methodological points	- Autoencoder (reconstruction task) with image transformation, not image masking.
develop/ different points	- A unique approach (;= recon?) different from both contrastive and non-contrastive (w/o negatives) - No negatives / Small mini-batch

Category	TIM (transformed info modeling(recon))

28 of 46

Paper	TWIST: Self-Supervised Learning by Estimating Twin Class Distributions	Confe, Cite	2
methodological points	- TWIST = Twin Class Distribution Estimation - (1) twin (consistent) class distributions of two augmented images. - To avoid collapse, they (2. sharpness term) minimize the entropy of the distribution for each sample to make the class prediction (3. diversity term) maximize the entropy of the mean distribution to make the predictions of different samples diverse.
develop/ different points	- Without asymmetric / stop-gradient / momentum encoder - They show the comparisons of TWIST to DINO, SwAV, Self-classifier, and Barlow Twins.

Category	New invariance loss

29 of 46

Paper	Self-Supervised Classification Network	Confe, Cite	4, IBM
methodological points	- Self-classifier (not predictor, projector, really for classification) learns labels and representations simultaneously - To avoid degenerate solutions, (1) variant of the cross-entropy (2) asserting a uniform prior on class predictions (1) cross entropy -> Bayes laws -> new form of CE loss, which is mathematically equivalent to CE loss, but has the ability to avoid collapse. (Proof is in Sec. 4) (2) Uniform prior: p(y) = 1/C \|\| p(x) = 1/N
develop/ different points	- Without a single label, 41.1% ImageNet accuracy. - It does not require a memory bank, a second network(momentum), external clustering, stop-gradient operation, or negative pairs.

Uniform Prior

It make the model avoid collapse

Category	new form of CE loss

30 of 46

Paper	iBOT: Image BERT Pre-Training with Online Tokenizer	Confe, Cite	22.ICLR22
methodological points	- iBoT performs masked prediction with an online tokenizer, which is taken as the momentum-updated teacher network. - Loss (1) self-distillation on [CLS] token like DINO (2) MIM objective
develop/ different points	- BEiT uses a pre-trained discrete VAE as the tokenizer only capturing low-level semantics within local details. Moreover, the tokenizer needs to be offline pre-trained which leads to potential limitations in its adaptivity to perform MIM using data from different domains. - Our tokenizer captures high-level visual semantics and needs no extra stages of training

shared

Category	Siamese with MIM

31 of 46

Paper	NNCLR: With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations	Confe, Cite	45, ICCV 2021, Deepmind
methodological points	- NNCLR samples the nearest neighbors from the dataset in the latent space and treats them as positives where more semantic variations can be provided. - The proposed method is less reliant on complex data augmentations. Specifically, a relative reduction of only 2.1% ImageNet accuracy is seen when we train using only random crops.
develop/ different points	- Most CL papers treat different views (by deformations) as positives, but they are interested in using positives from other instances in the dataset. - Past clustering-based methods also consider single instance positives, but assuming the entire cluster (or its prototype) to be positives could hurt performance due to early over-generalization. - In previous works, the onus of generalization lies heavily on the data augmentation pipeline, which cannot cover all the variances in a given class.

Category	NN positives

32 of 46

Paper	ReCM: Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations	Confe, Cite	3, NIPS21
methodological points	- The first research about how biases in the dataset affect (not only ImageNet) existing methods. For example, MoCo performs surprisingly well in object- and scene-centric datasets. - The study about additional invariances such as (1) multi-scale cropping, (2) stronger augmentation policy, and (3) nearest neighbors. -> A key component is the augmentation strategy which involves random cropping. (:= masked information) -> Why? when multiple objects are present, a positive pair of non-overlapping crops could lead us to wrongfully match the feature representations of different objects.
develop/ different points	- most methods still train on images from ImageNet, whose properties are (1) a single object in the center of the image (2) a class-uniform distribution ... - To deploy self-supervised learning into the wild, we need to quantify the dependence on these properties.

Category	Potential / Analysis of contrastive learning

33 of 46

Paper	MYOW: Mine Your Own vieW: Self-Supervised Learning Through Across-Sample Prediction	Confe, Cite	11, Georgia Tech
methodological points	- For contrastive learning without sufficient diversity in the transformations, MYOW utilizes the dataset itself to find similar samples (=minded views), i.e., finding the samples using the k-nearest neighbors of the anchor representation among the target representations of the pool of candidates.
develop/ different points	- Previous works are challenging to find the right balance between augmentations that not only introduce sufficient diversity but also preserve the semantics of the original data. - MYOW obtains more diversity by finding diverse views within the dataset, not limited to a single instance.

Dataset

Candidates

feature

space

select

the mined view

projector

Category	NN positives

34 of 46

Paper	SpectralCL: Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss	Confe, Cite	22.NIPS21
methodological points	- Positive pairs from the population augmentation graph leveraging continuity of the population data within the same class (=close neighbors within the same class).
develop/ different points	- Conditional independence of positive pairs -> (past) same class label -> (current) augmented views (not practical) -> (ours) samples in 'augmentation graph'.

?! T.T

Category	Graphic N positives

35 of 46

Paper	DnC: Divide and Contrast: Self-supervised Learning from Uncurated Data	Confe, Cite	12, Deepmind ICCV 2021
methodological points	- How to handle larger, less-curated datasets such as YFCC, different from ImageNet which is not a general, diverse, uncurated (사람의 가공작업(1000개의 클래스로만 제한 등)이 많이 들어간 데이터셋) dataset. - The base model learns global consistency while the expert model captures local consistency in subsets of YFCC. - a stronger baseline MoCLR (= SimCLR + BYOL) is suggested.
develop/ different points	- So, Btw ImageNet and YFCC, 'curation gap' exist due to a shift in the distribution of image classes. - [figure 1] DnC is better able to handle the diverse and long-tailed distribution of images and improves more with longer training then BYOL, MoCLR(SimCLR+BYOL)

Pre-trained�SimCLR & BYOL�(= MoCLR)�-> base model

Re-training�MoCLR�-> Expert model

Expert model

base model

k-th expert model

Category	How to handle Larger data

36 of 46

Paper	IFM: Can contrastive learning avoid shortcut solutions?	Confe, Cite	8, NIPS21
methodological points	- To learn representation without inductive biases (generalization), implicit feature modification (IFM) is proposed, which can plug-and-play to any kind of SSL methods. - They propose modifying features by applying transformations to encoded samples v = f(x). Since they modify the encoded samples, instead of raw inputs x, we describe our method as implicit. - For generalized representation by SSL, the embeddings of positive and negative samples are modified to remove well represented features. - Rather than applying extremely challenging perturbations, it is advantageous to learn generalized representations to transform the features of positive and negative features that are easy to discriminate.
develop/ different points	- Representation learned from CL is likely to inadvertently suppress important predictive features on downstream tasks. (=shortcuts solution, inductive biases) i.e., although hard augmentation improves ImageNet accuracy, it can suppress well-represented features.

strong raw image permutation can not be always good.

feature permutation

InfoNCE + IFM

InfoNCE

Category	New augmentation

37 of 46

Paper	SSL-HSIC: Self-Supervised Learning with Kernel Dependence Maximization	Confe, Cite	7, NIPS21
methodological points	- maximizing dependence(similarity) between representations of two views while minimizing the kernelized variance. - [Mathemetically] To measure dependence, Hilbert-Schmidt Independence Criterion (HSIC) is used. The proposed loss called SSL-HSIC loss is inspired by HSIC Bottleneck. - InfoNCE can be thought of as 'SSL-HSIC with a variance-based regularization'.
develop/ different points	- SSL-HSIC loss itself penalizes trivial solutions, so techniques such as target networks are not needed .

Category	New invariance loss

38 of 46

Paper	MoCHi: Hard Negative Mixing for Contrastive Learning	Confe, Cite	153, NIPS20
methodological points	- (Finding) Harder negatives are needed to facilitate better and faster learning. Thus, hard negative mixing strategies at the feature level are proposed. - Synthesizing hard negatives: mixing some of the hardest negative features of the contrastive loss. - the hardest negatives mean the closest features to each positive.
develop/ different points	- So far, large batch and large memory banks are used to get more meaningful negative samples.

Category	New negative feature augmentation

39 of 46

Paper	Understanding CL: Understanding the Behaviour of Contrastive Loss	Confe, Cite	63, CVPR21
methodological points	- This paper simply analyzes the importance and effect of temperature on contrastive loss. - Summary: The smaller the temperature, the more similar the features. Conversely, if the temperature is large, hard negatives are generated, so good features for downstream can be learned. - Analysis (1) the contrastive loss is a hardness-aware loss function, (Sec. 3) which automatically concentrates on separating more informative negative samples to make the embedding distribution more uniform. So, the temperature is important. (2) the temperature τ controls the strength of penalties on hard negative samples.
develop/ different points	- To learn separable features, tolerant (robust) to semantically similar samples, and useful features for downstream, temperature is important.

Category	The effect of temperature in CL

40 of 46

Paper	Property CL: Intriguing Properties of Contrastive Losses	Confe, Cite	34, NIPS21
methodological points	- This paper aims to understand the effectiveness and limitation of existing contrastive learning. - Aspect 1: (Sec 2) a generalization of the standard contrastive loss. - Aspect 2: (Sec 3) whether or not it leads to meaningful local features on images with multiple objects present. -> SimCLR can learn on images with multiple objects - Aspect 3: (Sec 4) The feature suppression phenomenon (For example, “color distribution” and “object class” are often competing-features (trade-off))
develop/ different points	- Existing contrastive learning methods critically rely on data augmentation to favor certain sets of features over others, and could suffer from learning saturation for scenarios.

Category	Analysis of CL

41 of 46

Paper	SCRL: Spatially Consistent Representation Learning	Confe, Cite	28, CVPR21
methodological points	- They propose a spatially consistent representation learning algorithm (SCRL) for multi-object and location-specific tasks, in order to realize the invariant spatial representation corresponding to the same cropped region under augmentations of a given image.
develop/ different points	- While previous contrastive methods mainly focus on generating invariant global representations at the image-level, they are prone to overlook spatial consistency of local representations. - Especially, previous methods aggressively crop views, which leads to minimizing representation distances between the semantically different regions of a single image, i.e., performance degradation on local representation.

BYOL architecture

Category	New invariance for local representation

42 of 46

Paper	Viewmaker Networks: Learning Views for Unsupervised Representation Learning	Confe, Cite	22, ICLR21
methodological points	- Viewmaker networks: generative models that learn to produce useful views from a given input. - The viewmaker is trained adversarially to output a stochastic perturbation that is added to the input. This perturbation is projected onto l_p sphere, controlling the effective strength of the view. - They use an image-to-image neural network as our viewmaker network, with an architecture adapted from work on style transfer
develop/ different points	- Designing views requires considerable trial and error by human experts, hindering widespread adoption (generalization). - InfoMin is not for a broadly-applicable approach for learning views.

Additional positives from viewmaker

Category	New augmentation

43 of 46

Paper	CASTing Your Model: Learning to Localize Improves Self-Supervised Representations	Confe, Cite	17, CVPR21
methodological points	- CAST (1) uses unsupervised saliency maps to intelligently sample crops and (2) to provide grounding supervision via a Grad-CAM attention loss. - The details of methods illustrate in Fig. 3
develop/ different points	- Two issues with recent CL (1) poor grounding (2) inconsistent samples [See fig.1] - Current SSL methods perform best on iconic images, and struggle on complex scene images with many objects.

Issues with previous works

Prerequisite

Main method

loss 1

loss 2

Category	New invariance for local representation

44 of 46

Paper	SCCL: Supporting Clustering with Contrastive Learning	Confe, Cite	21, NAACL 2021
methodological points	- jointly optimizing (1) clustering loss with (2) instance-wise contrastive loss (InfoNCE) - Clustering loss with feature centroids. - This target distribution (p) sharpens the soft-assignment probability (q).
develop/ different points	- A significant challenge of SSL clustering: different categories often overlap with each other in the representation space at the beginning of the learning

InfoNCE

Category	Clustering like SwAV

45 of 46

Paper	MaskCo: Self-Supervised Visual Representations Learning by Contrastive Mask Prediction	Confe, Cite	5, ICCV21
methodological points	- we propose (1) a novel contrastive mask prediction (CMP) task and (2) design a mask contrast (MaskCo) framework. - Fig. 3 is easy to figure out the proposed method.
develop/ different points	- Instnace discrimination task has an implicit semantic consistency (SC) assumption, but it may not hold in unconstrained datasets (CoCo, Multi objects).

Issues with previous works

Proposed new task (CMP)

Method

Category	Siamese, MIM with CNN

46 of 46

Paper
methodological points	Several works, however, have attempted to demystify the success of BYOL (Grill et al., 2020), a close variant of SimSiam. A technical report (Fetterman & Albrecht, 2020, BYOL1) has suggested the importance of batch normalization (BN) in BYOL for its success, however, a recent work (Richemond et al., 2020, BYOL2) refutes their claim by showing BYOL works without BN, which is discussed in Appendix B
develop/ different points

Category	Understanding BYOL

https://generallyintelligent.ai/blog/2020-08-24-understanding-self-supervised-contrastive-learning/

https://arxiv.org/abs/2010.10241

BYOL1

BYOL2

1 of 46

2 of 46

3 of 46

4 of 46

5 of 46

6 of 46

7 of 46

8 of 46

9 of 46

10 of 46

11 of 46

12 of 46

13 of 46

14 of 46

15 of 46

16 of 46

17 of 46

18 of 46

19 of 46

20 of 46

21 of 46

22 of 46

23 of 46

24 of 46

25 of 46

26 of 46

27 of 46

28 of 46

29 of 46

30 of 46

31 of 46

32 of 46

33 of 46

34 of 46

35 of 46

36 of 46

37 of 46

38 of 46

39 of 46

40 of 46

41 of 46

42 of 46

43 of 46

44 of 46

45 of 46

46 of 46