1 of 70

Rethinking the choice of loss functions�for classification with deep learning

Viacheslav Komisarenko, Meelis Kull

1

February 2, 2023

2 of 70

We have a deep learning classification model

2

3 of 70

We have a deep learning classification model

We have a measure-of-interest (evaluation metric)

we want to optimize

3

4 of 70

We have a deep learning classification model

We have a measure-of-interest (evaluation metric)

we want to optimize

Which loss function (train objective) to choose?

4

5 of 70

Measure-of-interest could be different

5

6 of 70

Ranking measures

6

Take into account mutual ranking of class logits but not their values itself

Examples: ROC-AUC,

Concordance and Discordance

scores

7 of 70

Thresholding

7

Evaluates logit (probs) location with respect to specific threshold.

Examples: error rate, total cost.

Common thresholds:

For logits: 0

For probs: 0.5

8 of 70

Probabilities that match true label

8

Examples: cross-entropy (CE), mean squared error (MSE)

 

 

 

9 of 70

Existing approaches to minimize metric-of-interest:

9

10 of 70

Existing approaches to minimize metric-of-interest:

Cross-entropy or similar losses (e.g. focal)

10

11 of 70

Existing approaches to minimize metric-of-interest:

Cross-entropy or similar losses (e.g. focal)

Fine-tuning

11

12 of 70

Existing approaches to minimize metric-of-interest:

Cross-entropy or similar losses (e.g. focal)

Fine-tuning

Stopping criteria

12

13 of 70

13

Cross-entropy, focal losses

14 of 70

  1. Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.

14

15 of 70

  1. Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.
  2. These models are often overconfident.

15

16 of 70

  1. Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.
  2. These models are often overconfident.
  3. Post-hoc calibration could help with it.

16

17 of 70

  1. Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.
  2. These models are often overconfident.
  3. Post-hoc calibration could help with it.
  4. Models trained with focal loss are much better calibrated.

17

18 of 70

  1. Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.
  2. These models are often overconfident.
  3. Post-hoc calibration could help with it.
  4. Models trained with focal loss are much better calibrated.
  5. Their performance is generally better than CE-trained models.

18

19 of 70

Cost-sensitive learning (binary)

19

Confusion matrix

20 of 70

Cost-sensitive learning (binary)

20

Confusion matrix

Cost matrix

21 of 70

Cost-sensitive learning (binary)

21

Confusion matrix

Cost matrix

 

22 of 70

Cost-sensitive learning (binary)

22

Confusion matrix

Cost matrix

 

 

23 of 70

Cost-sensitive learning (binary)

23

Confusion matrix

Cost matrix

 

 

24 of 70

24

Cost-sensitive learning (binary)

 

25 of 70

25

Cost-sensitive learning (binary)

 

26 of 70

26

Let FN is 5 times more costly than FP

 

27 of 70

27

Let FN is 5 times more costly than FP

 

 

28 of 70

28

Let FN is 5 times more costly than FP

 

 

 

29 of 70

29

Let FN is 5 times more costly than FP

 

 

 

 

 

 

30 of 70

30

Reminder

 

31 of 70

31

Reminder

 

32 of 70

32

Costs are rarely known exactly

33 of 70

33

Costs uncertainty

34 of 70

34

Certain metric from uncertain costs

 

 

35 of 70

35

Certain metric from uncertain costs

 

 

 

36 of 70

36

Measure-of-interest

 

 

Which loss function to choose to minimize expected total cost?

37 of 70

37

Example

 

 

Which loss function to choose to minimize expected total cost?

38 of 70

38

Theoretical result

 

Consider gradient of expected total cost with respect to p

39 of 70

39

Theoretical result

 

 

Consider gradient of expected total cost with respect to p

 

40 of 70

40

Implications

Expected total cost is differentiable (w.r.t. p)

41 of 70

41

Implications

Expected total cost is differentiable (w.r.t. p)

Hence, suitable for gradient-based optimization

42 of 70

42

Implications

Expected total cost is differentiable (w.r.t. p)

Hence, suitable for gradient-based optimization

Hypothesis: directly use expected total cost as training objective

43 of 70

43

 

 

44 of 70

44

 

 

 

45 of 70

45

 

 

 

 

46 of 70

46

 

 

 

 

 

47 of 70

47

Beta distribution

48 of 70

48

49 of 70

49

Could models trained with Beta losses be good

with evaluation on cost-sensitive and common measures?

50 of 70

50

Goal of experiments

Suppose we want to evaluate deep learning model

on expected total cost with known cost distribution

51 of 70

51

Goal of experiments

Suppose we want to evaluate deep learning model

on expected total cost with known cost distribution

Which loss function to choose?

52 of 70

52

Experiments setup

Class pairs from CIFAR-10, Fashion-MNIST

6 datasets in total: bird vs frog, ship vs car, deer vs plane,

Shirt vs Pullover, Top vs Shirt, Pullover vs Coat

53 of 70

53

Experiments setup

Class pairs from CIFAR-10, Fashion-MNIST

ResNet-18 architecture

20 random seeds

54 of 70

54

Experiments setup

Class pairs from CIFAR-10, Fashion-MNIST

ResNet-18 architecture

Test set: comparison with cross-entropy, focal loss, label smoothing

Validation set: choose the best Beta loss

20 random seeds

55 of 70

55

Experiments setup

Post-hoc temperature scaling calibration.

56 of 70

56

Experiments setup

Post-hoc temperature scaling calibration.

Test set evaluation was performed on the epoch

with best validation score

57 of 70

57

Beta loss parameters

 

We select Beta(25, 25) for further experiments.

58 of 70

58

59 of 70

59

Beta (25, 25) is better

(after temperature scaling calibration)

than CE, FL, LS on most cost-sensitive metrics

60 of 70

60

Beta (25, 25) is 1% better than best of (CE, FL, LS) evaluated on Beta (10, 40) expected total cost

61 of 70

61

Beta (25, 25) is better (after temperature scaling calibration)

on common measures

62 of 70

62

Cross-entropy and Beta (25, 25) losses improve performance

a lot after calibration

63 of 70

63

1. Criteria of choosing an epoch for test evaluation is important.

Observation

64 of 70

64

1. Criteria of choosing an epoch for test evaluation is important.

2. Different model epochs are better on some measures but worse on others.

Observation

65 of 70

65

1. Criteria of choosing an epoch for test evaluation is important.

2. Different model epochs are better on some measures but worse on others.

3. Performance of different losses but same epoch choosing criteria could be more similar than same loss but different epochs.

Observation

66 of 70

66

1. There are many evaluation metrics. Each of them require different training objective.

Conclusions

67 of 70

67

1. There are many evaluation metrics. Each of them require different training objective.

2. There are loss functions that could be better than cross-entropy based losses.

Conclusions

68 of 70

68

1. There are many measures that possibly require different loss.

2. Cross-entropy based losses are not the only possible choice.

3. Beta(25, 25) loss is a good loss for many measures.

Conclusions

69 of 70

69

1. There are many measures that possibly require different loss.

2. Cross-entropy based losses are not the only possible choice.

3. Beta(25, 25) loss is a good loss for many measures.

4. Good post-hoc calibration and stopping criteria are crucial.

Conclusions

70 of 70

70

Project: ETAg PRG1604