1 of 70

Rethinking the choice of loss functions�for classification with deep learning

Viacheslav Komisarenko, Meelis Kull

1

February 2, 2023

2 of 70

We have a deep learning classification model

2

3 of 70

We have a deep learning classification model

We have a measure-of-interest (evaluation metric)

we want to optimize

3

4 of 70

We have a deep learning classification model

We have a measure-of-interest (evaluation metric)

we want to optimize

Which loss function (train objective) to choose?

4

5 of 70

Measure-of-interest could be different

5

6 of 70

Ranking measures

6

Take into account mutual ranking of class logits but not their values itself

Examples: ROC-AUC,

Concordance and Discordance

scores

7 of 70

Thresholding

7

Evaluates logit (probs) location with respect to specific threshold.

Examples: error rate, total cost.

Common thresholds:

For logits: 0

For probs: 0.5

8 of 70

Probabilities that match true label

8

Examples: cross-entropy (CE), mean squared error (MSE)

9 of 70

Existing approaches to minimize metric-of-interest:

9

10 of 70

Existing approaches to minimize metric-of-interest:

Cross-entropy or similar losses (e.g. focal)

10

11 of 70

Existing approaches to minimize metric-of-interest:

Cross-entropy or similar losses (e.g. focal)

Fine-tuning

11

12 of 70

Existing approaches to minimize metric-of-interest:

Cross-entropy or similar losses (e.g. focal)

Fine-tuning

Stopping criteria

12

13 of 70

13

Cross-entropy, focal losses

14 of 70

Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.

14

15 of 70

Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.
These models are often overconfident.

15

16 of 70

Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.
These models are often overconfident.
Post-hoc calibration could help with it.

16

17 of 70

Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.
These models are often overconfident.
Post-hoc calibration could help with it.
Models trained with focal loss are much better calibrated.

17

18 of 70

Deep learning models trained with cross-entropy have good performance on accuracy and other evaluation metrics.
These models are often overconfident.
Post-hoc calibration could help with it.
Models trained with focal loss are much better calibrated.
Their performance is generally better than CE-trained models.

18

24

Cost-sensitive learning (binary)

25 of 70

25

Cost-sensitive learning (binary)

26 of 70

26

Let FN is 5 times more costly than FP

27 of 70

27

Let FN is 5 times more costly than FP

28 of 70

28

Let FN is 5 times more costly than FP

29 of 70

29

Let FN is 5 times more costly than FP

30 of 70

30

Reminder

31 of 70

31

Reminder

32 of 70

32

Costs are rarely known exactly

33 of 70

33

Costs uncertainty

34 of 70

34

Certain metric from uncertain costs

35 of 70

35

Certain metric from uncertain costs

36 of 70

36

Measure-of-interest

Which loss function to choose to minimize expected total cost?

37 of 70

37

Example

Which loss function to choose to minimize expected total cost?

38 of 70

38

Theoretical result

Consider gradient of expected total cost with respect to p

39 of 70

39

Theoretical result

Consider gradient of expected total cost with respect to p

40 of 70

40

Implications

Expected total cost is differentiable (w.r.t. p)

41 of 70

41

Implications

Expected total cost is differentiable (w.r.t. p)

Hence, suitable for gradient-based optimization

42 of 70

42

Implications

Expected total cost is differentiable (w.r.t. p)

Hence, suitable for gradient-based optimization

Hypothesis: directly use expected total cost as training objective

43 of 70

43

44 of 70

44

45 of 70

45

46 of 70

46

47 of 70

47

Beta distribution

48 of 70

48

49 of 70

49

Could models trained with Beta losses be good

with evaluation on cost-sensitive and common measures?

50 of 70

50

Goal of experiments

Suppose we want to evaluate deep learning model

on expected total cost with known cost distribution

51 of 70

57

Beta loss parameters

We select Beta(25, 25) for further experiments.

58 of 70

58

59 of 70

59

Beta (25, 25) is better

(after temperature scaling calibration)

than CE, FL, LS on most cost-sensitive metrics

60 of 70

60

Beta (25, 25) is 1% better than best of (CE, FL, LS) evaluated on Beta (10, 40) expected total cost

61 of 70

61

Beta (25, 25) is better (after temperature scaling calibration)

on common measures

62 of 70

62

Cross-entropy and Beta (25, 25) losses improve performance

a lot after calibration

63 of 70

63

1. Criteria of choosing an epoch for test evaluation is important.

Observation

64 of 70

64

1. Criteria of choosing an epoch for test evaluation is important.

2. Different model epochs are better on some measures but worse on others.

Observation

65 of 70

65

1. Criteria of choosing an epoch for test evaluation is important.

2. Different model epochs are better on some measures but worse on others.

3. Performance of different losses but same epoch choosing criteria could be more similar than same loss but different epochs.

Observation

66 of 70

66

1. There are many evaluation metrics. Each of them require different training objective.

Conclusions

67 of 70

67

1. There are many evaluation metrics. Each of them require different training objective.

2. There are loss functions that could be better than cross-entropy based losses.

Conclusions

68 of 70

68

1. There are many measures that possibly require different loss.

2. Cross-entropy based losses are not the only possible choice.

3. Beta(25, 25) loss is a good loss for many measures.

Conclusions

69 of 70

69

1. There are many measures that possibly require different loss.

2. Cross-entropy based losses are not the only possible choice.

3. Beta(25, 25) loss is a good loss for many measures.

4. Good post-hoc calibration and stopping criteria are crucial.

Conclusions

70 of 70

70

Project: ETAg PRG1604

�