1 of 44

Key Concepts of Conformal Prediction

Origins
What is Conformal Prediction
Why to look at performance metrics is not enough
How Conformal Prediction works
Why the predictions were not probabilities already
Non-conformity (aka Conformity) measure
Validity (aka Coverage Guarantee)
How to evaluate Conformal Prediction

María Moreno de Castro, July 2024

2 of 44

24 July 1998

2022

(First Ed. 2005)

ORIGINS

https://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf

https://people.eecs.berkeley.edu/~angelopoulos/publications/downloads/gentle_intro_conformal_dfuq.pdf

https://www.stat.berkeley.edu/~ryantibs/statlearn-s24/

3 of 44

Nowadays there are many packages in Python implementing Conformal Prediction methods

and in R:

5 of 44

Nowadays many companies apply Conformal Prediction methods

https://www.bbvaaifactory.com/conformal-prediction-an-introduction-to-measuring-uncertainty/

6 of 44

https://www.hpcwire.com/2023/03/28/whats-stirring-in-nvidias-rd-lab-chief-scientist-bill-dally-provides-a-peek/

Nvidia Research uses Conformal Prediction since 2022

7 of 44

https://elex-models-prod.s3.amazonaws.com/2020-general/write-up/election_model_writeup.pdf

Adaptive Conformal Inference (ACI) by Gibbs and Candes (NeurIPS 2021)

https://proceedings.neurips.cc/paper/2021/hash/0d441de75945e5acbc865406fc9a2559-Abstract.html

8 of 44

2. WHAT IS CONFORMAL PREDICTION?

CP is a family of algorithms that compute how much “unusual” is a prediction:

the stranger the prediction is, the less we trust that prediction

9 of 44

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

Performance metrics evaluate the model estimator correctness:

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, day 2: 0, ...

the model is correct those times predictions match observations,

e.g. ACC = 0.9

10 of 44

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

Performance metrics evaluate the model estimator correctness:

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, day 2: 0, ...

the model is correct those times predictions match observations,

e.g. ACC = 0.9

Even for a perfect model estimator,

not all the instances are equally hard to predict.

With a prediction range, we could distinguish:

uncertain predictions, e.g. (80+-50)%
certain predictions, e.g. (80+-5)%

11 of 44

https://robot-help.github.io ←super very revealing short video

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

12 of 44

https://valeman.medium.com/how-to-calibrate-your-classifier-in-an-intelligent-way-a996a2faf718

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

13 of 44

https://amp.theguardian.com/us-news/2021/jul/02/algorithm-crucial-healthcare-decisions

https://www.theverge.com/2018/3/21/17144260/healthcare-medicaid-algorithm-arkansas-cerebral-palsy

Vincent Warmerdam’s “Mean Squared Terror” https://koaning.io/posts/mean-squared-terror/

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

Otro ejemplo muy grave es el de esta persona a la que la compania de seguros negaba el pago de un tratamiento que claramente necesitaba. En el juicio, el abogado de la compañía de seguros argumento “creanme, un grupo de gente inteligente ha determinado que esta es la manera inteligente de hacerlo” lo cual indica que una responsabilidad que ha todas luces es médica pasa a estar en manos, no de la IA, sino de las personas que desarrollan esa IA, vamos nosotras.

Vincent Warmerman, mi proramador favorito, tiene un blog buenisimo aqui dejo el enlace a esta representacion donde el abogado dice “Oh, tiene un doctorado y lleva traje, mejor confio en lo que dice” y el cientifico de datos dice “ he iterado unos cuantos ajuste y he conseguido que de un numerico vamos”, un ya compila y estoy feliz en toda regla

14 of 44

Prediction intervals: those that we calculate with CP, with statistical guarantee contain the ground truth, i.e., they meet validity, also called "coverage guarantee". In classification, Computer Visión, and NLP, they are not continuous intervals, e.g. [5.5, 6.5] but sets of potential outcomes, e.g. [class B, class C]. If the prediction is very certain, its prediction interval/set just a value, like a single-point prediction.

Confidence intervals of frequentist statistics: associated with the uncertainty in estimating a population parameter (e.g. the expected value) from a sampling parameter (e.g. the mean). They also indicate a range but in this case they indicate a dispersion around the estimator (e.g. the mean), not the uncertainty of an individual prediction. The classic example: [mean - std, mean + std]. They do not fulfill validity.

Credible intervals of posteriors in Bayesian statistics: like our prediction intervals, these also include the prediction with a given probability, i.e., they treat their bounds as fixed and the prediction as a random variable (frequentist confidence intervals treat their bounds as random variables and the prediction as a fixed value). They require knowledge of the a priori distribution that is specific to the inputs of each situation (not at all simple). They do not fulfill validity.

Intervals types and their validity

15 of 44

4. HOW CONFORMAL PREDICTION WORKS?

The predictions are made by a statistical, ML, DL,... estimator model

The original CP was “Transductive”. Now the most popular way is the “Split” aka “Inductive” where we separate 3 sets:

training set: to train the estimator model
calibration set: to train the conformal predictor (sm pple call it “validation set”, just to confuse us all 😜)
test set: to test the predictions

In the ICP set up we post-process the predictions, thus ICP is:

Post-hoc (no need to retrain the estimator model)
Model-agnostic (CP does not need to know the estimator model)
Distribution-free (CP is not-parametric)

More info about TCP vs ICP here: https://valeman.medium.com/how-to-use-full-transductive -conformal-prediction and other CP forms here: https://mlwithouttears.com/2024/02/04/ fifty-four-actually-shades-of-conformal-prediction/

16 of 44

4. HOW CONFORMAL PREDICTION WORKS?

April 2025

Validity is mathematically guaranteed if the calibration and the test sets are exchangeable (that is weaker than the iid requirement assumed by ML models). However:

Two new papers proved validity without the exchangeability requirement:

Split CP and Non-exchangable data (proved bounded validity)
Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them) (proved total validity but computationally costly)

Many CP methods are under development to work around the exchangeability requirement for for time series and they probe to achieve empirical guarantee.

17 of 44

Put the predictions together.

Decide a significant level (alpha) to define the “strangeness” threshold.

Measure how different are new predictions with respect to that threshold.

Append in a list the “non-strange” predictions.

If the list is long many predictions were possible. The shorter the list, the more certain the model was.

Sort the predictions by a non-conformity measure

If alpha is 0.05, we are setting that 95% of predictions will be considered less extreme than that threshold

Check if all of the potential predictions for a new instance, e.g. one of the test set, are more or less extreme than predictions of the calibration set

For each potential prediction of a new instance, if its non-conformity score > alpha (i.e. the score falls within the 95% of the non-conformity scores in the calibration set), that prediction for that new instance is not strange, then add it to the prediction set

Calibration set

BASIC STEPS:

18 of 44

Put the predictions together.

Decide a significant level (alpha) to define the “strangeness” threshold.

The threshold defines the lower and upper bounds of the “strangeness” for new predictions.

If the bounds are close to each other, the interval is small and more certain was the model.

Sort the predictions by a non-conformity measure

If alpha is 0.05, calculate the 95th-quantile, Q_1-alpha,thus 95% of predictions are less extreme than that threshold

Calibration set

BASIC STEPS for regression:

For each new instance in the test set, calculate the prediction interval as [y_pred - Q_1-alpha, y_pred + Q_1-alpha]

19 of 44

These were just the basic general steps of vanilla inductive conformal prediction (ICP).

Every CP in the family modifies the steps in different ways to address different tasks:

To create intervals with adaptive width

for classification: for multiclass or binary (Venn-ABERS) or multi-label,
for regression: Jacknife+, Quantile Regression to get CQR, …

To predict the whole distribution with Conformal Predictive Distributions,
To conformalize:

Time Series Forecasting and Classification, ○ LM, y LLM here, here, here, and here
Anomaly Detection, Microsoft, here ○ Survival Analysis here, here, and here,
Distribution Shift and Change Detection ○ Risk Control,
Image classification, this or this or this ○ Data Centers decarbonization,
Physical Informed Neural Networks, ○ Graph NN and their robustness under shifts,
RAG and Information Retrieval ○ Reinforcement Learning and Bandits,
XAI, counterfactuals here or here, CP+SHAP ○ Causality: here, here, here or here,
ROC AUC ○ Bayesian NN or optimization here, here,here
Monte-Carlo: DeepMind + Drop-out here, here

20 of 44

https://carl-mcbride-ellis.github.io/TOBoML/TOBoML.pdf

about ML in general but including CP

Two Free Books to start with

about Time Series in general but including CP

https://otexts.com/fpppy/

21 of 44

https://mindfulmodeler.substack.com/

p/week-1-getting-started-with-conformal

https://christophmolnar.com/books/

conformal-prediction/

“A prerequisite for trust in machine learning is uncertainty quantification. Without it, an accurate prediction and a wild guess look the same.”

see the code example of this presentation in the repo: https://github.com/MMdeCastro/Uncertainty_Quantification_XAI

https://www.amazon.es/Practical-Guide-Applied-Conformal-Prediction/dp/1805122762

https://arxiv.org/abs/2411.11824

22 of 44

Demo interactiva for Transductive CP for classification: https://cml.rhul.ac.uk/CP/index.html

List videos using regression: h ttps://youtu.be/xZbuFKWV5NA?si=513LFry0Vr9Q3XV5

23 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the model estimator are not probabilities unless the model estimator is calibrated by default (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

https://scikit-learn.org/stable/modules/calibration.html

“Are you sure that's a probability?” https://kiwidamien.github.io/are-you-sure-thats-a-probability.html

24 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the estimator are not probabilities unless the estimator is calibrated (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

25 of 44

https://youtu.be/UPtG_38Oq8o?si=5mgVbD7PJK9qkCwg

https://youtu.be/jbluHIgBmBo?si=ozbYpm8Nic51LWTZ

Predictions are raw scores that passed

through a sigmoid, softmax,... that normalized them btw 0 and 1 so they add to 1, but they are not calibrated, they are not probabilities

26 of 44

https://github.com/scikit-learn/scikit-learn/blob/7b136e9/sklearn/neural_network/multilayer_perceptron.py#L267

example MLP in scikit learn:

activation = 'relu'

but that is for hidden layers, look at the output activation functions

27 of 44

https://youtu.be/_wZ1Lo7bhGg?si=3tXnzoQsFqq9X2_B

example: Decision Tree for Regression

find the best features and values to split into regions
take the mean value (in green) of the region

28 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the model estimator are not probabilities unless the model estimator is calibrated (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

TO CALIBRATE: to match y_pred with y_true

29 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the model estimator are not probabilities unless the model estimator is calibrated (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

TO CALIBRATE: to match y_pred with y_true

.predict_proba(x_i)
0.8
0.5
0.4

true label: y_i
1
0
1

30 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the model estimator are not probabilities unless the model estimator is calibrated (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

TO CALIBRATE: to match y_pred with y_true

Calibration is good for performance too:

“Properties and Benefits of Calibrated Classifiers” https://link.springer.com/chapter/10.1007/978-3-540-30116-5_14

31 of 44

6. NON-CONFORMITY MEASURE (aka CONFORMITY) MEASURE

Manokhin’s Practical-Guide-Applied-Conformal-Prediction

It is the way we choose to define “strangeness” or “similarity”, they are equivalent, just the way you call it, “non-conformity” or “conformity”
It evaluates the coherence between the prediction and the actual value, where a smaller score denotes a closer match.
It can be whatever function on real numbers, the better it is, the narrower prediction intervals you get (= efficiency)

32 of 44

6. NON-CONFORMITY MEASURE (aka CONFORMITY) MEASURE

33 of 44

7. VALIDITY (aka COVERAGE GUARANTEE)

.predict_ proba(x_i)
0.7
0.5
0.4

true label: y_i
1
0
1

https://machinelearningmastery.com/joint-marginal-and-conditional-probability-for-machine-learning/

34 of 44

Validity is fulfilled if the prediction intervals contain the true label at the given confidence defined by alpha

Marginal coverage: on average, it is theoretically proven for the whole CP family

Conditional coverage: per individual

it is only empirically/heuristically proven (e.g. Mondrian CP achieves it per class); in a distribution-free setting, conditional coverage is impossible to achieve by any mean for every point, as it has been theoretically proven

https://www.stat.berkeley.edu/~ryantibs/statlearn-s23/lectures/conformal.pdf

7. VALIDITY (aka COVERAGE GUARANTEE)

35 of 44

Conditional validity for individual predictions is impossible for any distribution-free method as it has been mathematically proven. Any distribution-free method attempting to do so blows prediction intervals into infinite size with probability of 1.

https://www.stat.berkeley.edu/~ryantibs/statlearn-s23/lectures/conformal.pdf

36 of 44

UC Berkeley Special Series on Conformal Prediction: https://youtu.be/rvYnR0FGxM4?si=IlC8AEAJ8DRgXTyx

37 of 44

and

CP calibrate the predictions thus they are truly probabilities
CP allow the estimator to express its confidence on its predictions with prediction intervals

UPDATE OF SLIDE 2: WHAT IS CONFORMAL PREDICTION?

CP is a family of algorithms that compute how much the probability a prediction is “unusual” is a prediction:

the stranger the prediction is, the less we trust that prediction

NOTE: CP marginal validity is mathematically proven under data exchangeability assumption, although they are ways to empirically work around it (for instance, for time series) and a recent papers showed CP validity without the exchangeability requirement (see slide 4. HOW CONFORMAL PREDICTION WORKS?)

38 of 44

8. HOW TO EVALUATE CP

The calibration: if the predictions are a true probability:

P(y_true | prediction = p) = p

Classification: popular stricly proper metrics for binary and multiclass

Brier Score: (y_pred_positive_class - y_true_label)²
Log-Loss:

-log(y_pred_positive_class) for the positive class, and
-log(1-y_pred_positive_class) for the negative class

and the closer to zero, the better, for both metrics

General: calibration plots (aka reliability diagrams) where 1:1 diagonal means perfect calibration

39 of 44

8. HOW TO EVALUATE CP: CALIBRATION EXAMPLE I

https://scikit-learn.org/stable/modules/calibration.html#calibration-curves

40 of 44

8. HOW TO EVALUATE CP: CALIBRATION EXAMPLE II

https://publikationen.bibliothek.kit.edu/1000171223

41 of 44

The prediction intervals/sets provide confidence in the predictions and are evaluated by calculating their:

coverage: whether intervals/sets include the ground truth:

P(y_true in PI) >= 1- alpha

efficiency (aka sharpness): how narrow = informative are:

in classification: cardinality, sometimes also care about the number of empty and single-label sets
in regression/time series: popular metrics are Continuous Ranked Probability Score (CRPS) for distributions and Winkler score (kind of an approximation of CRPS for intervals), both also account for coverage.

8. HOW TO EVALUATE CP

https://www.annualreviews.org/docserver/fulltext/statistics/1/1/annurev-statistics-062713-085831.pdf?expires=1717935001&id=id&accname=guest&checksum=20620FD1E077910BAB9E6F54AF887AF2

42 of 44

https://arxiv.org/abs/2405.16828?trk=feed-detail_comments-list_comment-text

8. HOW TO EVALUATE CP: PREDICTION INTERVALS EXAMPLE I

43 of 44

https://proceedings.mlr.press/v139/xu21h/xu21h.pdf

8. HOW TO EVALUATE CP: PREDICTION INTERVALS EXAMPLE II

44 of 44

https://valeman.medium.com/how-to-evaluate-probabilistic-forecasts-ace8b7ad349

“All other alternative Uncertainty Quantification methods do not have in-built validity guarantees. In the first independent comparative study of all four classes of uncertainty quantification methods, only Conformal Prediction satisfied the property of validity.” (aka coverage)

Dewolf et al. https://arxiv.org/abs/2107.00363