1 of 52

Are all negatives created equal in contrastive instance discrimination?

Presented by: Guillaume Lagrange, Nathaniel Simard

2 of 52

Introduction

3 of 52

Contrastive Learning

Summary

  • Similar to augmented view�
  • Dissimilar to negative views

Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee and Fillia Makedon. A Survey on Contrastive Self-supervised Learning. arXiv preprint arXiv:2011.00362, 2020.

4 of 52

Self-Supervised Contrastive Learning

Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee and Fillia Makedon. A Survey on Contrastive Self-supervised Learning. arXiv preprint arXiv:2011.00362, 2020.

Self-supervised contrastive learning

Transfer learning to downstream task

5 of 52

Negative Pairs

  • Maximize distance between different images that are potentially similar.
  • Maximize distance between different images.

Cons

Pros

  • Avoid trivial solution where everything is similar.
  • Increased complexity.

6 of 52

Which Negatives Are Most Important?

7 of 52

Which Negatives Are Most Important?

Joshua Robinson, Ching-Yao Chuang, Suvrit Sra and Stefanie Jegelka. Contrastive Learning with Hard Negative Samples. arXiv preprint arXiv:2010.06682, 2020.

8 of 52

Supervised Contrastive Learning

Clusters of points belonging to the same class are pulled together in embedding space, while simultaneously pushing apart clusters of samples from different classes.

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu and Dilip Krishnan. Supervised Contrastive Learning. In NeurIPS, 2020.

Self-supervised contrastive loss: contrasts a single positive for each anchor against a set of negatives (rest of the batch).�

Supervised contrastive loss: contrasts the set of all samples from the same class as positives against the negatives (remainder of the batch).

9 of 52

Measuring the Impacts of Negative Samples

10 of 52

Framework analyzed: MoCo v2

https://untitled-ai.github.io/understanding-self-supervised-contrastive-learning.html

contrastive loss

x

v

data augmentation

q

k0

similarity

encoder

momentum encoder

memory queue

k1

k2

k3

...

kK

v’

11 of 52

What’s necessary or sufficient?

  • The difficulty is measure by the similarity with the query
    • More similar → harder
  • The similarity is calculated with a dot product.

12 of 52

Without the hardest negatives

  • Debiased
  • Approx Debiased
  • Biased + 0.1% Removed

13 of 52

Debiased Contrastive Learning

14 of 52

Overview

  • “Debiased” contrastive objective that corrects for the sampling of same-label datapoints, even without knowledge of the true labels.
  • Assume downstream classification task. (Is this good for transfer learning?)

Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba and Stefanie Jegelka. Debiased Contrastive Learning. In NeurIPS, 2020.

15 of 52

Debiased Loss

  • True Positive
  • Negative Samples
  • Class Probability

16 of 52

Positive Samples

(M = 1) Unsupervised Learning: draw a different view of the input image

(M > 1) Semi-supervised Learning: draw multiple positive samples from the same class using labels

17 of 52

Debiased Loss

Re-weighting by subtracting from the negative samples the expected number of false negative with true positives.

18 of 52

Debiased

Still optimize for uniformity -- but better optimize for tight clusters.

Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba and Stefanie Jegelka. Debiased Contrastive Learning. In NeurIPS, 2020.

19 of 52

Results

Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba and Stefanie Jegelka. Debiased Contrastive Learning. In NeurIPS, 2020.

[Max Schwarzer]

ImageNet has 1000 classes, so the odds of any negative sample falling into the same class as the source image are roughly 0.1%. Since the authors found that removing negatives in the same class had an effect roughly comparable to removing these extreme hard negatives, and since harder examples tend to be more semantically similar, doesn't this suggest that the 0.1% threshold is essentially filtering out same-class images? If so, should we expect the 0.1% finding to generalize to other datasets with different downstream tasks or numbers of classes?

20 of 52

Results

Ching-Yao Chuang, Joshua Robinson, Lin Yen-Chen, Antonio Torralba and Stefanie Jegelka. Debiased Contrastive Learning. In NeurIPS, 2020.

21 of 52

When to Use It?

  • Good pretext task to use when only a fraction of the dataset is annotated
    • Can leverage multiple positive samples
  • Good when we know the downstream task is classification
  • Maybe good to use on other downstream tasks?

22 of 52

Contrastive Learning with Hard Negative Samples

23 of 52

Contrastive Learning with Hard Negative Samples

Principle 1. Should only sample “negatives” whose labels differ from that of the anchor.

Consider a de-biasing of the effect of positive samples (that should be near to the anchor).

Principle 2. The most useful negative samples are ones that the embedding currently believes to be similar to the anchor.

Re-weight the negative samples based on the euclidean distance from the anchor point

Joshua Robinson, Ching-Yao Chuang, Suvrit Sra and Stefanie Jegelka. Contrastive Learning with Hard Negative Samples. arXiv preprint arXiv:2010.06682, 2020.

24 of 52

Debias + Reweighted Loss

  • True Positive
  • Negative Samples
  • Class Probability
  • Reweighting

25 of 52

Simple Pseudo Code

26 of 52

Summary

Emphasize 5% of the hardest negative → Re-weighting��Remove 0.1 % of the hardest negative → Debiased

27 of 52

Hard Negative Mixing for Contrastive Learning

28 of 52

Overview

To get more meaningful negative samples, hard negative mixing proposes to synthesize harder negatives by mixing hard negative samples at the feature-level in hopes to facilitate better and faster learning.

Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel and Diane Larlus. Hard Negative Mixing for Contrastive Learning. In NeurIPS, 2020.

Synthesized negatives

K samples

s + s’ << K samples

https://untitled-ai.github.io/understanding-self-supervised-contrastive-learning.html

29 of 52

Hard Negative Mining: A Brief History

Initially introduced in face detectors to reduce training computation over millions of background examples (negatives). Training starts with a small set of background samples and then iteratively add new miss-classified backgrounds.

Bootstrapping

Sung & Poggio, 1994.

https://nextjournal.com/tempdata73/dealing-with-class-imbalances

Face detection

Selecting the number of "face" (positive class) instances present is fairly easy, but in order to get a good representation of the "non-face" (negative class) instances, we must include every other object in the universe.

In practice, this is limited to the negatives in the dataset, which can still be enormous. The number of false positives increases when the number of negative instances present in the dataset decreases, so we must find a balance.

Now enters bootstrapping (form of hard negative mining).

Start with a small set of non-face examples

i.e. wrongly classified as positives

30 of 52

Hard Negative Mining: A Brief History

Similarly, in object detection hard negative mining is used to address severe class imbalance between the positive and negative training examples (N >> P). It usually results in faster convergence as the hard negatives have better contribution.

Seunghyeon Kim, Jaehoon Choi, Taekyung Kim and Changick Kim. Self-Training and Adversarial Background Regularization for Unsupervised Domain Adaptive One-Stage Object Detection. In ICCV, 2019.

Ground-truth

boat

person

Negative examples

Positive examples

Detector

Hard negative mining

Negatives w/ highest loss

Loss function

31 of 52

Hard Negative Mining: A Brief History

Also referred to in the deep metric learning community as a strategy for triplet selection (mining) with the triplet loss or n-pair loss (aka contrastive loss).

Hong Xuan, Abby Stylianou and Robert Pless. Improved Embeddings with Easy Positive Triplet Mining. In WACV, 2020.

All negatives

Hardest negative

Hard negative examples are the most similar images which have a different label from the anchor image.

See also: semi-hard negative mining, easy negative mining, hard positive mining, easy positive mining.

32 of 52

Method

MoCo v2

contrastive loss

x

v

data augmentation

q

k0

similarity

encoder

momentum encoder

memory queue

k1

k2

k3

...

kK

v’

Ke Ding, Xuanji He and Guanglu Wan. Learning Speaker Embedding with Momentum Contrast. arXiv preprint arXiv:2001.01986, 2020.

33 of 52

Method

MoCHi

Mixing of�Contrastive Hard negatives

similarity

n1

n2

n3

...

nN

Synthesized negatives

ordered memory queue�(N most similar)

x

v

data augmentation

q

k0

similarity

encoder

momentum encoder

memory queue

k1

k2

k3

...

kK

v’

Synth hardest negatives

contrastive loss

34 of 52

Mixup

Simple regularization method that works by interpolating input representations. It encourages neural networks to favor simple linear behavior in-between training examples.

Shahine Bouabid and Vincent Delaitre. Mixup Regularization for Region Proposal based Object Detectors. arXiv preprint arXiv:2003.02065, 2020.

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin and David Lopez-Paz. mixup: Beyond Empirical Risk Minimization. In ICLR, 2018.

Sample λ from the parametrized Beta(α, α) distribution, then mix two input images and their targets linearly.

35 of 52

Manifold Mixup

Simple regularization method that encourages neural networks to predict less confidently on interpolations of hidden representations.

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz and Yoshua Bengio. Manifold Mixup: Better Representations by Interpolating Hidden States. In ICML, 2019.

λ h1 + (1-λ) h2

Mixup at random kth layer

[1, 0, 0, 0]

[0, 0, 0, 1]

Predictions

[λ, 0, 0, 1-λ]

Interpolated target

[1, 0, 0, 0]

+

[0, 0, 0, 1]

(1 - λ)

λ

k = 0: input mixup

36 of 52

Feature Space Mixing of Hard Negatives

Construct ordered set of hardest negative features, sorted by decreasing similarity with query, truncated to the first items

�For each query, synthesize hard negative features by convex combination (mixup)

�For each query, synthesize hardest negative features by convex combination (mixup)

Synth hardest negatives

Synthesized negatives

n1

n2

n3

...

nN

Computational overhead of MoCHi is essentially equivalent to increasing the size of the queue/memory by

Mixing coeff. in (0, 1)

Mixing coeff. in (0, 0.5)

so the query contribution is always smaller

37 of 52

Results

N s s’

N s s’

Dataset: ImageNet-100 (1k)

Warm-up: 10 (15) epochs where we do not synthesize hard negatives

Batch size: 128 (512)

Memory bank: 16k (65k)

38 of 52

Results

  1. Better transfer learning performance and achieved faster thanks to the harder negative strategy.
  2. MoCHi does not show performance gains over MoCo-v2 for linear classification on ImageNet-1K (similar performance).�→ Biases induced by training with hard negatives on the same dataset as the downstream task?

Hard negative mixing reduces alignment and increases uniformity for the dataset that is used during training.

[Nitarshan]

This paper trains its representations on ImageNet and evaluates on ImageNet - how do you think these findings would change in a transfer setting to another domain/task?

39 of 52

Results

  • Better transfer learning performance and achieved faster thanks to the harder negative strategy.
  • MoCHi does not show performance gains over MoCo-v2 for linear classification on ImageNet-1K (similar performance).�→ Biases induced by training with hard negatives on the same dataset as the downstream task?

Hard negative mixing reduces alignment and increases uniformity for the dataset that is used during training.

40 of 52

Results

41 of 52

Measuring Similarity of Negative Samples

42 of 52

Semantic Similarity to the Query

43 of 52

WordNet

Animal

Dog

Cat

Living

Plant

Flower

Domestic Animal

Elephant

Tree

Oak

Classes

Common Ancestor

Depth

Dog

Cat

Domestic Animal

2

Rose

-

Oak

Plant

1

Dog

-

Elephant

Animal

1

Rose

-

Elephant

Living

0

Rose

44 of 52

Difficulty Distribution Across Negative Samples

  • Correlated
  • Anti-Correlated

45 of 52

Hierarchical Conceptual Understanding

  • Does it hurt performance?
  • Is it only with contrastive loss?
  • Is it a problem with the dataset?
  • Doesn’t seem to be a problem with Word2Vec
    • King + Woman = Queen
  • Is context required?

Anti-Correlated

http://www.reportingday.com/wp-content/uploads/2018/06/Cat-Sleeping-Pics.jpg https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.aspca.org%2Fsites%2Fdefault%2Ffiles%2Fproblems-older-dog_main.jpg&f=1&nofb=1

[Hattie Zhou]

The existence of anti-correlated, semantically similar examples is very surprising / unintuitive. Can you speculate on what might be happening with those examples?

46 of 52

Hierarchical Conceptual Understanding

  • Does it hurt performance?
  • Is it only with contrastive loss?
  • Is it a problem with the dataset?
  • Doesn’t seem to be a problem with Word2Vec
    • King + Woman = Queen
  • Is context required?

Anti-Correlated

http://www.reportingday.com/wp-content/uploads/2018/06/Cat-Sleeping-Pics.jpg https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.aspca.org%2Fsites%2Fdefault%2Ffiles%2Fproblems-older-dog_main.jpg&f=1&nofb=1

47 of 52

Conclusion

48 of 52

Conclusion

  • Not all negative samples are equal (false negatives, hardest negatives)�
  • Random sampling in NCE (contrastive loss) is unlikely to yield hard negatives, and as learning progresses it becomes even increasingly harder to pick a negative that contributes sufficient gradient, causing slow convergence�
  • Hard negative mining usually offers faster convergence

49 of 52

Discussion

  • How to efficiently leverage the top 5% hard negative samples?
    • Can it be used to optimize training time?
  • How to emphasize the 5% hard negative samples?
    • Hard negative mixing
    • Hard negative sampling: hard negatives are weighted to contribute more to the loss
  • Removing the 0.1% hardest negative samples (false negatives).
    • Partially resolved by debiased contrastive loss.
    • How does it perform on downstream tasks other than classification?
  • How to tackle the problem of anti-correlation with semantically similar samples?

50 of 52

The Elephant in the Room

Focusing on the positive pair?

51 of 52

The Elephant in the Room

What could be the pros and cons to focusing on the negatives versus focusing on the positives?

Where’s my memory?

52 of 52

Questions