Rethinking the choice of loss functions�for classification with deep learning
Viacheslav Komisarenko, Meelis Kull
1
February 2, 2023
We have a deep learning classification model
2
We have a deep learning classification model
We have a measure-of-interest (evaluation metric)
we want to optimize
3
We have a deep learning classification model
We have a measure-of-interest (evaluation metric)
we want to optimize
Which loss function (train objective) to choose?
4
Measure-of-interest could be different
5
Ranking measures
6
Take into account mutual ranking of class logits but not their values itself
Examples: ROC-AUC,
Concordance and Discordance
scores
Thresholding
7
Evaluates logit (probs) location with respect to specific threshold.
Examples: error rate, total cost.
Common thresholds:
For logits: 0
For probs: 0.5
Probabilities that match true label
8
Examples: cross-entropy (CE), mean squared error (MSE)
Existing approaches to minimize metric-of-interest:
9
Existing approaches to minimize metric-of-interest:
Cross-entropy or similar losses (e.g. focal)
10
Existing approaches to minimize metric-of-interest:
Cross-entropy or similar losses (e.g. focal)
Fine-tuning
11
Existing approaches to minimize metric-of-interest:
Cross-entropy or similar losses (e.g. focal)
Fine-tuning
Stopping criteria
12
13
Cross-entropy, focal losses
14
15
16
17
18
Cost-sensitive learning (binary)
19
Confusion matrix
Cost-sensitive learning (binary)
20
Confusion matrix
Cost matrix
Cost-sensitive learning (binary)
21
Confusion matrix
Cost matrix
Cost-sensitive learning (binary)
22
Confusion matrix
Cost matrix
Cost-sensitive learning (binary)
23
Confusion matrix
Cost matrix
24
Cost-sensitive learning (binary)
25
Cost-sensitive learning (binary)
26
Let FN is 5 times more costly than FP
27
Let FN is 5 times more costly than FP
28
Let FN is 5 times more costly than FP
29
Let FN is 5 times more costly than FP
30
Reminder
31
Reminder
32
Costs are rarely known exactly
33
Costs uncertainty
34
Certain metric from uncertain costs
35
Certain metric from uncertain costs
36
Measure-of-interest
Which loss function to choose to minimize expected total cost?
37
Example
Which loss function to choose to minimize expected total cost?
38
Theoretical result
Consider gradient of expected total cost with respect to p
39
Theoretical result
Consider gradient of expected total cost with respect to p
40
Implications
Expected total cost is differentiable (w.r.t. p)
41
Implications
Expected total cost is differentiable (w.r.t. p)
Hence, suitable for gradient-based optimization
42
Implications
Expected total cost is differentiable (w.r.t. p)
Hence, suitable for gradient-based optimization
Hypothesis: directly use expected total cost as training objective
43
44
45
46
47
Beta distribution
48
49
Could models trained with Beta losses be good
with evaluation on cost-sensitive and common measures?
50
Goal of experiments
Suppose we want to evaluate deep learning model
on expected total cost with known cost distribution
51
Goal of experiments
Suppose we want to evaluate deep learning model
on expected total cost with known cost distribution
Which loss function to choose?
52
Experiments setup
Class pairs from CIFAR-10, Fashion-MNIST
6 datasets in total: bird vs frog, ship vs car, deer vs plane,
Shirt vs Pullover, Top vs Shirt, Pullover vs Coat
53
Experiments setup
Class pairs from CIFAR-10, Fashion-MNIST
ResNet-18 architecture
20 random seeds
54
Experiments setup
Class pairs from CIFAR-10, Fashion-MNIST
ResNet-18 architecture
Test set: comparison with cross-entropy, focal loss, label smoothing
Validation set: choose the best Beta loss
20 random seeds
55
Experiments setup
Post-hoc temperature scaling calibration.
56
Experiments setup
Post-hoc temperature scaling calibration.
Test set evaluation was performed on the epoch
with best validation score
57
Beta loss parameters
We select Beta(25, 25) for further experiments.
58
59
Beta (25, 25) is better
(after temperature scaling calibration)
than CE, FL, LS on most cost-sensitive metrics
60
Beta (25, 25) is 1% better than best of (CE, FL, LS) evaluated on Beta (10, 40) expected total cost
61
Beta (25, 25) is better (after temperature scaling calibration)
on common measures
62
Cross-entropy and Beta (25, 25) losses improve performance
a lot after calibration
63
1. Criteria of choosing an epoch for test evaluation is important.
Observation
64
1. Criteria of choosing an epoch for test evaluation is important.
2. Different model epochs are better on some measures but worse on others.
Observation
65
1. Criteria of choosing an epoch for test evaluation is important.
2. Different model epochs are better on some measures but worse on others.
3. Performance of different losses but same epoch choosing criteria could be more similar than same loss but different epochs.
Observation
66
1. There are many evaluation metrics. Each of them require different training objective.
Conclusions
67
1. There are many evaluation metrics. Each of them require different training objective.
2. There are loss functions that could be better than cross-entropy based losses.
Conclusions
68
1. There are many measures that possibly require different loss.
2. Cross-entropy based losses are not the only possible choice.
3. Beta(25, 25) loss is a good loss for many measures.
Conclusions
69
1. There are many measures that possibly require different loss.
2. Cross-entropy based losses are not the only possible choice.
3. Beta(25, 25) loss is a good loss for many measures.
4. Good post-hoc calibration and stopping criteria are crucial.
Conclusions
70
Project: ETAg PRG1604
�