Are all negatives created equal in contrastive instance discrimination?
Presented by: Guillaume Lagrange, Nathaniel Simard
Introduction
Contrastive Learning
Summary
Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee and Fillia Makedon. A Survey on Contrastive Self-supervised Learning. arXiv preprint arXiv:2011.00362, 2020.
Self-Supervised Contrastive Learning
Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee and Fillia Makedon. A Survey on Contrastive Self-supervised Learning. arXiv preprint arXiv:2011.00362, 2020.
Self-supervised contrastive learning
Transfer learning to downstream task
Negative Pairs
Cons
Pros
Which Negatives Are Most Important?
Which Negatives Are Most Important?
Joshua Robinson, Ching-Yao Chuang, Suvrit Sra and Stefanie Jegelka. Contrastive Learning with Hard Negative Samples. arXiv preprint arXiv:2010.06682, 2020.
Supervised Contrastive Learning
Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes.
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu and Dilip Krishnan. Supervised Contrastive Learning. In NeurIPS, 2020.
Self-supervised contrastive loss: contrasts a single positive for each anchor against a set of negatives (rest of the batch).�
Supervised contrastive loss: contrasts the set of all samples from the same class as positives against the negatives (remainder of the batch).
Measuring the Impacts of Negative Samples
Framework analyzed: MoCo v2
https://untitled-ai.github.io/understanding-self-supervised-contrastive-learning.html
contrastive loss
x
v
data augmentation
q
k0
similarity
encoder
momentum encoder
memory queue
k1
k2
k3
...
kK
v’
What’s necessary or sufficient?
Without the hardest negatives
Debiased Contrastive Learning
Overview
Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba and Stefanie Jegelka. Debiased Contrastive Learning. In NeurIPS, 2020.
Debiased Loss
Positive Samples
(M = 1) Unsupervised Learning: draw a different view of the input image
(M > 1) Semi-supervised Learning: draw multiple positive samples from the same class using labels
Debiased Loss
Re-weighting by subtracting from the negative samples the expected number of false negative with true positives.
Debiased
Still optimize for uniformity -- but better optimize for tight clusters.
Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba and Stefanie Jegelka. Debiased Contrastive Learning. In NeurIPS, 2020.
Results
Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba and Stefanie Jegelka. Debiased Contrastive Learning. In NeurIPS, 2020.
[Max Schwarzer]
ImageNet has 1000 classes, so the odds of any negative sample falling into the same class as the source image are roughly 0.1%. Since the authors found that removing negatives in the same class had an effect roughly comparable to removing these extreme hard negatives, and since harder examples tend to be more semantically similar, doesn't this suggest that the 0.1% threshold is essentially filtering out same-class images? If so, should we expect the 0.1% finding to generalize to other datasets with different downstream tasks or numbers of classes?
Results
Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba and Stefanie Jegelka. Debiased Contrastive Learning. In NeurIPS, 2020.
When to Use It?
Contrastive Learning with Hard Negative Samples
Contrastive Learning with Hard Negative Samples
Principle 1. Should only sample “negatives” whose labels differ from that of the anchor.
Consider a de-biasing of the effect of positive samples (that should be near to the anchor).
Principle 2. The most useful negative samples are ones that the embedding currently believes to be similar to the anchor.
Re-weight the negative samples based on the euclidean distance from the anchor point
Joshua Robinson, Ching-Yao Chuang, Suvrit Sra and Stefanie Jegelka. Contrastive Learning with Hard Negative Samples. arXiv preprint arXiv:2010.06682, 2020.
Debias + Reweighted Loss
Simple Pseudo Code
Summary
Emphasize 5% of the hardest negative → Re-weighting��Remove 0.1 % of the hardest negative → Debiased
Hard Negative Mixing for Contrastive Learning
Overview
To get more meaningful negative samples, hard negative mixing proposes to synthesize harder negatives by mixing hard negative samples at the feature-level in hopes to facilitate better and faster learning.
Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel and Diane Larlus. Hard Negative Mixing for Contrastive Learning. In NeurIPS, 2020.
Synthesized negatives
K samples
s + s’ << K samples
https://untitled-ai.github.io/understanding-self-supervised-contrastive-learning.html
Hard Negative Mining: A Brief History
Initially introduced in face detectors to reduce training computation over millions of background examples (negatives). Training starts with a small set of background samples and then iteratively add new miss-classified backgrounds.
Bootstrapping
Sung & Poggio, 1994.
https://nextjournal.com/tempdata73/dealing-with-class-imbalances
Face detection
Selecting the number of "face" (positive class) instances present is fairly easy, but in order to get a good representation of the "non-face" (negative class) instances, we must include every other object in the universe.
In practice, this is limited to the negatives in the dataset, which can still be enormous. The number of false positives increases when the number of negative instances present in the dataset decreases, so we must find a balance.
Now enters bootstrapping (form of hard negative mining).
Start with a small set of non-face examples
i.e. wrongly classified as positives
Hard Negative Mining: A Brief History
Similarly, in object detection hard negative mining is used to address severe class imbalance between the positive and negative training examples (N >> P). It usually results in faster convergence as the hard negatives have better contribution.
Seunghyeon Kim, Jaehoon Choi, Taekyung Kim and Changick Kim. Self-Training and Adversarial Background Regularization for Unsupervised Domain Adaptive One-Stage Object Detection. In ICCV, 2019.
Ground-truth
boat
person
Negative examples
Positive examples
Detector
Hard negative mining
Negatives w/ highest loss
Loss function
Hard Negative Mining: A Brief History
Also referred to in the deep metric learning community as a strategy for triplet selection (mining) with the triplet loss or n-pair loss (aka contrastive loss).
Hong Xuan, Abby Stylianou and Robert Pless. Improved Embeddings with Easy Positive Triplet Mining. In WACV, 2020.
All negatives
Hardest negative
Hard negative examples are the most similar images which have a different label from the anchor image.
See also: semi-hard negative mining, easy negative mining, hard positive mining, easy positive mining.
Method
MoCo v2
contrastive loss
x
v
data augmentation
q
k0
similarity
encoder
momentum encoder
memory queue
k1
k2
k3
...
kK
v’
Ke Ding, Xuanji He and Guanglu Wan. Learning Speaker Embedding with Momentum Contrast. arXiv preprint arXiv:2001.01986, 2020.
Method
MoCHi
�Mixing of�Contrastive Hard negatives
similarity
n1
n2
n3
...
nN
Synthesized negatives
ordered memory queue�(N most similar)
x
v
data augmentation
q
k0
similarity
encoder
momentum encoder
memory queue
k1
k2
k3
...
kK
v’
Synth hardest negatives
contrastive loss
Mixup
Simple regularization method that works by interpolating input representations. It encourages neural networks to favor simple linear behavior in-between training examples.
Shahine Bouabid and Vincent Delaitre. Mixup Regularization for Region Proposal based Object Detectors. arXiv preprint arXiv:2003.02065, 2020.
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin and David Lopez-Paz. mixup: Beyond Empirical Risk Minimization. In ICLR, 2018.
Sample λ from the parametrized Beta(α, α) distribution, then mix two input images and their targets linearly.
Manifold Mixup
Simple regularization method that encourages neural networks to predict less confidently on interpolations of hidden representations.
Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz and Yoshua Bengio. Manifold Mixup: Better Representations by Interpolating Hidden States. In ICML, 2019.
λ h1 + (1-λ) h2
Mixup at random kth layer
[1, 0, 0, 0]
[0, 0, 0, 1]
Predictions
[λ, 0, 0, 1-λ]
Interpolated target
[1, 0, 0, 0]
+
[0, 0, 0, 1]
(1 - λ)
λ
k = 0: input mixup
Feature Space Mixing of Hard Negatives
Construct ordered set of hardest negative features, sorted by decreasing similarity with query, truncated to the first items
�For each query, synthesize hard negative features by convex combination (mixup)
�For each query, synthesize hardest negative features by convex combination (mixup)
Synth hardest negatives
Synthesized negatives
n1
n2
n3
...
nN
Computational overhead of MoCHi is essentially equivalent to increasing the size of the queue/memory by
Mixing coeff. in (0, 1)
Mixing coeff. in (0, 0.5)
so the query contribution is always smaller
Results
N s s’
N s s’
Dataset: ImageNet-100 (1k)
Warm-up: 10 (15) epochs where we do not synthesize hard negatives
Batch size: 128 (512)
Memory bank: 16k (65k)
Results
Hard negative mixing reduces alignment and increases uniformity for the dataset that is used during training.
[Nitarshan]
This paper trains its representations on ImageNet and evaluates on ImageNet - how do you think these findings would change in a transfer setting to another domain/task?
Results
Hard negative mixing reduces alignment and increases uniformity for the dataset that is used during training.
Results
Measuring Similarity of Negative Samples
Semantic Similarity to the Query
WordNet
Animal
Dog
Cat
Living
Plant
Flower
Domestic Animal
Elephant
Tree
Oak
Classes | Common Ancestor | Depth |
Dog
Cat | Domestic Animal | 2 |
Rose - Oak | Plant | 1 |
Dog - Elephant | Animal | 1 |
Rose - Elephant | Living | 0 |
Rose
Difficulty Distribution Across Negative Samples
Hierarchical Conceptual Understanding
Anti-Correlated
http://www.reportingday.com/wp-content/uploads/2018/06/Cat-Sleeping-Pics.jpg https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.aspca.org%2Fsites%2Fdefault%2Ffiles%2Fproblems-older-dog_main.jpg&f=1&nofb=1
[Hattie Zhou]
The existence of anti-correlated, semantically similar examples is very surprising / unintuitive. Can you speculate on what might be happening with those examples?
Hierarchical Conceptual Understanding
Anti-Correlated
http://www.reportingday.com/wp-content/uploads/2018/06/Cat-Sleeping-Pics.jpg https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.aspca.org%2Fsites%2Fdefault%2Ffiles%2Fproblems-older-dog_main.jpg&f=1&nofb=1
Conclusion
Conclusion
Discussion
The Elephant in the Room
Focusing on the positive pair?
The Elephant in the Room
What could be the pros and cons to focusing on the negatives versus focusing on the positives?
Where’s my memory?
Questions