1 of 24

On the Calibration of Deep Learning Models �to Improve Trustworthy AI

College of Engineering

Computer Science

Cornelia Caragea

2 of 24

Human Confidence and Calibration

2

What movie won the Best Picture at Oscars 2023?

IRg

Information Retrieval Group

UIC Computer Science

3 of 24

Human Confidence and Calibration

3

Who is Prime Minister in UK?

Keir Starmer

IRg

Information Retrieval Group

UIC Computer Science

4 of 24

Machines…

4

Do they know what they don’t know?

Or in other words… are they calibrated?

IRg

Information Retrieval Group

UIC Computer Science

5 of 24

Deep Neural Networks

Deep neural networks (DNNs) have established supremacy in many pattern recognition tasks such as object detection, speech recognition, natural language processing.

They are increasingly used in decision-making pipelines and high-risk fields such as medical diagnosis, autonomous vehicle control, and the legal sector.

Major challenges: uncertainty and trustworthiness of a classifier.

The DNN must not only be accurate, but also indicate when it is likely to get the wrong answer.

This allows the decision-making to be routed as needed to a human or another more accurate, but possibly more expensive, classifier, with the assumption being that the additional cost incurred is greatly surpassed by the consequences of a wrong prediction.

5

IRg

Information Retrieval Group

UIC Computer Science

With the deployment of AI-based systems in high risk fields such as medical diagnosis [26], autonomous vehicle control [21] and the legal sector [1], the major challenges of the upcoming era are thus going to be in issues of uncertainty and trust-worthiness of a classifier. With deep neural networks having established supremacy in many pattern recognition tasks, it is the predictive uncertainty of these types of classifiers that will be of increasing importance. The DNN must not only be accurate, but also indicate when it is likely to get the wrong answer. This allows the decision-making to be routed as needed to a human or another more accurate, but possibly more expensive, classifier, with the assumption being that the additional cost incurred is greatly surpassed by the consequences of a wrong prediction.

If the detection network is not able to confidently predict the presence or absence of immediate obstructions, the car should rely more on the output of other sensors for braking.

6 of 24

DNNs Confidence and Calibration

In a well-calibrated classifier, predictive scores should be indicative of the actual likelihood of correctness.
Modern architectures, it turns out, are prone to overconfidence.

6

Credit for the plots: Thulasidasan et al. [2019].

Accuracy vs confidence on CIFAR-100 at different training epochs for VGG-16 neural net.

IRg

Information Retrieval Group

UIC Computer Science

7 of 24

Calibration in Pre-trained Language Models

Current pre-trained language models are often poorly calibrated [Kong et al., 2020] (most often being overly-confident).

7

E.g., reliability diagram of BERT fine-tuned on text classification using 20NG15 dataset (the first 15 categories of the 20NG dataset).

IRg

Information Retrieval Group

UIC Computer Science

8 of 24

Over-confidence

Most modern DNNs, when trained for classification in a supervised learning setting, are trained using one-hot encoded labels that have all the probability mass in one class

The training labels are thus zero-entropy signals that admit no uncertainty about the input.
The DNN is thus, in some sense, trained to become overconfident.

8

IRg

Information Retrieval Group

UIC Computer Science

1.00

0.00

9 of 24

Calibration Techniques

Temperature Scaling [Guo et al., 2017; Desai and Durrett, 2020]

A post-processing step that re-scales the logits using a single scale hyperparameter temperature T that is learned on a validation set.

T → ∞ yields maximum uncertainty with uniform probabilities,
As T → 0, the probability drops to a point mass.

Label Smoothing [Müller et al., 2019; Kumar and Sarawagi, 2019; Desai and Durrett, 2020]

A regularization technique that prevents over-confident predictions toward one single class by using soft labels.

For example, the one-hot label vector [1, 0, 0] is converted to [0.9, 0.05, 0.05] smoothed label vector.

9

IRg

Information Retrieval Group

UIC Computer Science

10 of 24

MixUp

10

MixUp [Zhang et al., 2018]

A data augmentation method in which additional samples are generated during training by combining random samples of training inputs and their associated labels.

IRg

Information Retrieval Group

UIC Computer Science

11 of 24

11

On the Calibration of Pre-trained Language Models using MixUp Guided by Area Under the Margin and Saliency

IRg

Information Retrieval Group

UIC Computer Science

[Park and Caragea, ACL 2022; NAACL 2022]

[Hosseini and Caragea, ACL-Finding 2023; EMNLP 2022]

12 of 24

Proposed MixUp for Model Calibration

12

We propose a MixUp method that is targeted at improving model calibration.

We leverage a model’s training dynamics, Area Under the Margin, [Pleiss et al., 2020] to reveal samples with distinct pronounced characteristics

whether they are easy-to-learn or hard-to-learn/ambiguous for the model.

We generate MixUp samples by mixing easy-to-learn with hard-to-learn/ambiguous samples according to their similarity/dissimilarity provided by saliency maps [Simonyan et al., 2013].

IRg

Information Retrieval Group

UIC Computer Science

13 of 24

Mixup using Saliency Signals

13

Mixing easy-to-learn samples with the most similar hard-to-learn samples calibrates in-domain data.

Mixing easy-to-learn samples with the most dissimilar hard-to-learn samples calibrate out-of-domain data.

IRg

Information Retrieval Group

UIC Computer Science

14 of 24

Datasets

14

Tasks used for evaluation :

Natural Language Inference

In-domain : SNLI [Bowman et al., 2015]
Out-of-domain: MNLI [Williams et al., 2018]

Paraphrase Detection

In-domain: QQP [Iyer et al., 2017]
Out-of-domain: TwitterPPDB [Lan et al., 2017]

Commonsense Reasoning

In-domain: SWAG [Zellers et al., 2018]
Out-of-domain: HellaSWAG [Zeller et al., 2019]

We use in-domain trained models to predict out-of-distribution test samples.

IRg

Information Retrieval Group

UIC Computer Science

15 of 24

In-domain Data Results on BERT

15

Our proposed MixUp results in best ECE values for all ID tasks

(similar results are observed on RoBERTa).

IRg

Information Retrieval Group

UIC Computer Science

16 of 24

Out-of-domain Data Results on BERT

16

Our proposed MixUp results in best ECE values for all OOD tasks

(similar results are observed on RoBERTa).

IRg

Information Retrieval Group

UIC Computer Science

17 of 24

LLMs Confidence and Calibration

17

IRg

Information Retrieval Group

UIC Computer Science

18 of 24

LLMs Confidence and Calibration

18

[1] Sadat and Caragea, 2022: SciNLI: A Corpus for Natural Language Inference on Scientific Text.

IRg

Information Retrieval Group

UIC Computer Science

[2] Sadat and Caragea, 2024: MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference

.

19 of 24

LLMs Confidence and Calibration

19

IRg

Information Retrieval Group

UIC Computer Science

20 of 24

We explore prompting strategies to capture the highest overall accuracy of LLMs and yield well-calibrated responses from LLMs.

LLAMA-2
LLAMA-3
Phi-3

To obtain models’ confidence, we explore verbalized approaches and the internal probabilities.

20

LLMs Confidence and Calibration

21 of 24

Confidence Elicitation

21

22 of 24

LLM Results

22

IRg

Information Retrieval Group

UIC Computer Science

23 of 24

Conclusion

23

We proposed a novel MixUp guided by the Area Under the Margins (AUM) and Saliency Maps to mitigate the miscalibration of pre-trained language models BERT and RoBERTa.

We showed that our proposed MixUp achieves the lowest Expected Calibration Errors (ECE) for both pre-trained language models on various types of NLU tasks, for both in-domain and out-of-domain data.

We explored several prompting strategies for LLM model calibration.

IRg

Information Retrieval Group

UIC Computer Science

24 of 24

Thank you!

24

DPI

Seo Yeon Park

Mobashir Sadat

Tiberiu Sosea

Mahshid Hosseini

IRg

Information Retrieval Group

UIC Computer Science

Anas Jawad