1 of 11

A Metric Learning Reality Check

Kevin Musgrave, Serge Belongie, Ser-Nam Lim

ECCV 2020

CSG@UEF Reading Club

Xuechen Liu

2021.10.29

2 of 11

Background

I began to have a glance about embedding loss on Knowledge Distillation (KD)
KL-Divergence/MSE as a complement or independent ingredient for transferring representational knowledge
In KD sometimes it is referred as ‘similarity metric’ or ‘embedding loss’
Also, this paper demos some shortages reflective on current ML research

3 of 11

Non-Deep (conventional) Metric Learning

Mostly formed as an optimization problem
PCA & LDA
Others are ANMM (better ver. Of LDA), LMNN, NCA (nearest-neighbor approach).....
Check this paper for more details (very mathematical, I’m still reading it)
Toolkits

pyDML: https://github.com/jlsuarezdiaz/pyDML
Metric-learn: https://metric-learn.github.io/metric-learn/ (only include supervised algorithms)

4 of 11

Deep Metric Learning

Main objective is to learn a similarity function to cluster similar objects & separate ones otherwise (for Image Classification)
Usually it is composed of a feature extractor, an embedding network, a sampling module (not necessarily a DNN)
Sampling sometimes can be something else, depends on application (e.g. pooling/mean average for ASV)

5 of 11

Common Objectives

MSE
Contrastive loss
Triplet loss
Prototypical loss (e.g. GE2E)
Many more….

Cross Entropy
Margin Softmax
CosFace
ArcFace (mostly used for SoTA ASV)
Many more….

Classification

Metric Learning

6 of 11

Looks Good? But There are Flaws

Unfair comparisons

Network architecture: Google’s Proxy-NCA
Optimizer: Adam and RMSprop converge faster, SGD is better in generalization
Batch design

Data manipulation

Image (or speech) augmentation
Probably less stressful because of time & resource limit

Accuracy metrics

Normalized mutual information (NMI)

Training with test set feedback

Probably the one I feel mostly uncomfortable

7 of 11

OK, So We Need Benchmark

Open-source [link]
Consistent model architecture, optimizer, batch size, etc.

Training stops for plateau recog. acc

Informative accuracy metric

Mean Average Precision + R-precision (MAP@R)
(but of course, no perfect solutions)

Hyperparameter search via cross-validation

Split the datasets in terms of class labels, not data (25% validation set)
Training halts for maximum validation accuracy

9 of 11

“Paper vs. Reality”

Conventional still are powerful

Contrastive, triplet

Low baselines from earlier papers
Implementation flaws
The values of hand-wavy experiments may be questionable

I don’t agree - depends on the task (FaceReg for original papers, bird/vehicles for this one)

10 of 11

My Takeaways

The problems mentioned in this paper are quite reflective to other fields as well

E.g. Speech Recognition, Speaker Verification

The power of some missed small details can result in publicable difference

Which is not good

One potential missing point: not comparing with very conventional statistical methods
But meanwhile, it is also a very good curated paper on SoTA metric learning methods

Anybody interested in knowledge distillation?

11 of 11

Bonus

Little bit on my recent SPL publication
Let’s see if my experiments will be criticized by those stuff

Unfair comparison?
Data manipulation?
Accuracy metrics?
Training with test set feedback?