1 of 11

A Metric Learning Reality Check

Kevin Musgrave, Serge Belongie, Ser-Nam Lim

ECCV 2020

CSG@UEF Reading Club

Xuechen Liu

2021.10.29

2 of 11

Background

  • I began to have a glance about embedding loss on Knowledge Distillation (KD)
  • KL-Divergence/MSE as a complement or independent ingredient for transferring representational knowledge
  • In KD sometimes it is referred as ‘similarity metric’ or ‘embedding loss’
  • Also, this paper demos some shortages reflective on current ML research

3 of 11

Non-Deep (conventional) Metric Learning

  • Mostly formed as an optimization problem
  • PCA & LDA
  • Others are ANMM (better ver. Of LDA), LMNN, NCA (nearest-neighbor approach).....
  • Check this paper for more details (very mathematical, I’m still reading it)
  • Toolkits

4 of 11

Deep Metric Learning

  • Main objective is to learn a similarity function to cluster similar objects & separate ones otherwise (for Image Classification)
  • Usually it is composed of a feature extractor, an embedding network, a sampling module (not necessarily a DNN)
  • Sampling sometimes can be something else, depends on application (e.g. pooling/mean average for ASV)

5 of 11

Common Objectives

  • MSE
  • Contrastive loss
  • Triplet loss
  • Prototypical loss (e.g. GE2E)
  • Many more….
  • Cross Entropy
  • Margin Softmax
  • CosFace
  • ArcFace (mostly used for SoTA ASV)
  • Many more….

Classification

Metric Learning

6 of 11

Looks Good? But There are Flaws

  • Unfair comparisons
    • Network architecture: Google’s Proxy-NCA
    • Optimizer: Adam and RMSprop converge faster, SGD is better in generalization
    • Batch design
  • Data manipulation
    • Image (or speech) augmentation
    • Probably less stressful because of time & resource limit
  • Accuracy metrics
    • Normalized mutual information (NMI)
  • Training with test set feedback
    • Probably the one I feel mostly uncomfortable

7 of 11

OK, So We Need Benchmark

  • Open-source [link]
  • Consistent model architecture, optimizer, batch size, etc.
    • Training stops for plateau recog. acc
  • Informative accuracy metric
    • Mean Average Precision + R-precision (MAP@R)
    • (but of course, no perfect solutions)
  • Hyperparameter search via cross-validation
    • Split the datasets in terms of class labels, not data (25% validation set)
    • Training halts for maximum validation accuracy

8 of 11

Experiments

9 of 11

“Paper vs. Reality”

  • Conventional still are powerful
    • Contrastive, triplet
  • Low baselines from earlier papers
  • Implementation flaws
  • The values of hand-wavy experiments may be questionable
    • I don’t agree - depends on the task (FaceReg for original papers, bird/vehicles for this one)

10 of 11

My Takeaways

  • The problems mentioned in this paper are quite reflective to other fields as well
    • E.g. Speech Recognition, Speaker Verification
  • The power of some missed small details can result in publicable difference
    • Which is not good
  • One potential missing point: not comparing with very conventional statistical methods
  • But meanwhile, it is also a very good curated paper on SoTA metric learning methods
    • Anybody interested in knowledge distillation?

11 of 11

Bonus

  • Little bit on my recent SPL publication
  • Let’s see if my experiments will be criticized by those stuff
    • Unfair comparison?
    • Data manipulation?
    • Accuracy metrics?
    • Training with test set feedback?