1 of 44

Key Concepts of Conformal Prediction

  1. Origins
  2. What is Conformal Prediction
  3. Why to look at performance metrics is not enough
  4. How Conformal Prediction works
  5. Why the predictions were not probabilities already
  6. Non-conformity (aka Conformity) measure
  7. Validity (aka Coverage Guarantee)
  8. How to evaluate Conformal Prediction

María Moreno de Castro, July 2024

2 of 44

24 July 1998

2022

(First Ed. 2005)

  1. ORIGINS

3 of 44

Nowadays there are many packages in Python implementing Conformal Prediction methods

and in R:

4 of 44

5 of 44

Nowadays many companies apply Conformal Prediction methods

6 of 44

Nvidia Research uses Conformal Prediction since 2022

7 of 44

Adaptive Conformal Inference (ACI) by Gibbs and Candes (NeurIPS 2021)

https://proceedings.neurips.cc/paper/2021/hash/0d441de75945e5acbc865406fc9a2559-Abstract.html

8 of 44

2. WHAT IS CONFORMAL PREDICTION?

CP is a family of algorithms that compute how much “unusual” is a prediction:

the stranger the prediction is, the less we trust that prediction

9 of 44

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

Performance metrics evaluate the model estimator correctness:

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, day 2: 0, ...

the model is correct those times predictions match observations,

e.g. ACC = 0.9

10 of 44

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

Performance metrics evaluate the model estimator correctness:

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, day 2: 0, ...

the model is correct those times predictions match observations,

e.g. ACC = 0.9

Even for a perfect model estimator,

not all the instances are equally hard to predict.

With a prediction range, we could distinguish:

  • uncertain predictions, e.g. (80+-50)%
  • certain predictions, e.g. (80+-5)%

11 of 44

https://robot-help.github.io ←super very revealing short video

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

12 of 44

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

13 of 44

13

Vincent Warmerdam’s “Mean Squared Terror” https://koaning.io/posts/mean-squared-terror/

3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?

14 of 44

Prediction intervals: those that we calculate with CP, with statistical guarantee contain the ground truth, i.e., they meet validity, also called "coverage guarantee". In classification, Computer Visión, and NLP, they are not continuous intervals, e.g. [5.5, 6.5] but sets of potential outcomes, e.g. [class B, class C]. If the prediction is very certain, its prediction interval/set just a value, like a single-point prediction.

Confidence intervals of frequentist statistics: associated with the uncertainty in estimating a population parameter (e.g. the expected value) from a sampling parameter (e.g. the mean). They also indicate a range but in this case they indicate a dispersion around the estimator (e.g. the mean), not the uncertainty of an individual prediction. The classic example: [mean - std, mean + std]. They do not fulfill validity.

Credible intervals of posteriors in Bayesian statistics: like our prediction intervals, these also include the prediction with a given probability, i.e., they treat their bounds as fixed and the prediction as a random variable (frequentist confidence intervals treat their bounds as random variables and the prediction as a fixed value). They require knowledge of the a priori distribution that is specific to the inputs of each situation (not at all simple). They do not fulfill validity.

14

Intervals types and their validity

15 of 44

4. HOW CONFORMAL PREDICTION WORKS?

  • The predictions are made by a statistical, ML, DL,... estimator model

  • The original CP was “Transductive”. Now the most popular way is the “Split” aka “Inductive” where we separate 3 sets:
      • training set: to train the estimator model
      • calibration set: to train the conformal predictor (sm pple call it “validation set”, just to confuse us all 😜)
      • test set: to test the predictions

  • In the ICP set up we post-process the predictions, thus ICP is:
      • Post-hoc (no need to retrain the estimator model)
      • Model-agnostic (CP does not need to know the estimator model)
      • Distribution-free (CP is not-parametric)

More info about TCP vs ICP here: https://valeman.medium.com/how-to-use-full-transductive -conformal-prediction and other CP forms here: https://mlwithouttears.com/2024/02/04/ fifty-four-actually-shades-of-conformal-prediction/

16 of 44

4. HOW CONFORMAL PREDICTION WORKS?

April 2025

Validity is mathematically guaranteed if the calibration and the test sets are exchangeable (that is weaker than the iid requirement assumed by ML models). However:

17 of 44

  1. Put the predictions together.

  • Decide a significant level (alpha) to define the “strangeness” threshold.

  • Measure how different are new predictions with respect to that threshold.

  • Append in a list the “non-strange” predictions.

  • If the list is long many predictions were possible. The shorter the list, the more certain the model was.

Sort the predictions by a non-conformity measure

If alpha is 0.05, we are setting that 95% of predictions will be considered less extreme than that threshold

Check if all of the potential predictions for a new instance, e.g. one of the test set, are more or less extreme than predictions of the calibration set

For each potential prediction of a new instance, if its non-conformity score > alpha (i.e. the score falls within the 95% of the non-conformity scores in the calibration set), that prediction for that new instance is not strange, then add it to the prediction set

Calibration set

BASIC STEPS:

18 of 44

  • Put the predictions together.

  • Decide a significant level (alpha) to define the “strangeness” threshold.

  • The threshold defines the lower and upper bounds of the “strangeness” for new predictions.

  • If the bounds are close to each other, the interval is small and more certain was the model.

Sort the predictions by a non-conformity measure

If alpha is 0.05, calculate the 95th-quantile, Q1-alpha,thus 95% of predictions are less extreme than that threshold

Calibration set

BASIC STEPS for regression:

For each new instance in the test set, calculate the prediction interval as [y_pred - Q1-alpha, y_pred + Q1-alpha]

19 of 44

These were just the basic general steps of vanilla inductive conformal prediction (ICP).

Every CP in the family modifies the steps in different ways to address different tasks:

20 of 44

about ML in general but including CP

Two Free Books to start with

about Time Series in general but including CP

21 of 44

A prerequisite for trust in machine learning is uncertainty quantification. Without it, an accurate prediction and a wild guess look the same.

see the code example of this presentation in the repo: https://github.com/MMdeCastro/Uncertainty_Quantification_XAI

22 of 44

Demo interactiva for Transductive CP for classification: https://cml.rhul.ac.uk/CP/index.html

23 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the model estimator are not probabilities unless the model estimator is calibrated by default (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

24 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the estimator are not probabilities unless the estimator is calibrated (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

25 of 44

Predictions are raw scores that passed

through a sigmoid, softmax,... that normalized them btw 0 and 1 so they add to 1, but they are not calibrated, they are not probabilities

1

2

3

4

26 of 44

example MLP in scikit learn:

activation = 'relu'

but that is for hidden layers, look at the output activation functions

27 of 44

example: Decision Tree for Regression

  1. find the best features and values to split into regions
  2. take the mean value (in green) of the region

28 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the model estimator are not probabilities unless the model estimator is calibrated (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

TO CALIBRATE: to match y_pred with y_true

29 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the model estimator are not probabilities unless the model estimator is calibrated (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

TO CALIBRATE: to match y_pred with y_true

.predict_proba(x_i)

0.8

0.5

0.4

true label: y_i

1

0

1

30 of 44

5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES

The predictions of the model estimator are not probabilities unless the model estimator is calibrated (which hardly occurs):

Weather forecast: Rain (1), No rain (0)

Predictions: day 1: 80%, day 2: 20%, …

Observations: day 1: 1, 0, ...

a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained

P(y_true | prediction = p) = p

TO CALIBRATE: to match y_pred with y_true

Calibration is good for performance too:

“Properties and Benefits of Calibrated Classifiers” https://link.springer.com/chapter/10.1007/978-3-540-30116-5_14

31 of 44

6. NON-CONFORMITY MEASURE (aka CONFORMITY) MEASURE

  • It is the way we choose to define “strangeness” or “similarity”, they are equivalent, just the way you call it, “non-conformity” or “conformity”
  • It evaluates the coherence between the prediction and the actual value, where a smaller score denotes a closer match.
  • It can be whatever function on real numbers, the better it is, the narrower prediction intervals you get (= efficiency)

32 of 44

6. NON-CONFORMITY MEASURE (aka CONFORMITY) MEASURE

Most popular non-conformity measures:

  • classification:
  • hinge loss = 1 - (y_pred_true_class), it favors narrow sets
  • margin loss = |y_pred_true_class - most likely class|, it favours single-label sets

  • regression:
  • abs error (residuals): |y_true - y_pred|
  • scaled abs error: |y_true - y_pred|/var, to take heteroscedasticity into account

33 of 44

7. VALIDITY (aka COVERAGE GUARANTEE)

.predict_

proba(x_i)

0.7

0.5

0.4

true label:

y_i

1

0

1

34 of 44

Validity is fulfilled if the prediction intervals contain the true label at the given confidence defined by alpha

  • Marginal coverage: on average, it is theoretically proven for the whole CP family

  • Conditional coverage: per individual

it is only empirically/heuristically proven (e.g. Mondrian CP achieves it per class); in a distribution-free setting, conditional coverage is impossible to achieve by any mean for every point, as it has been theoretically proven

7. VALIDITY (aka COVERAGE GUARANTEE)

35 of 44

Conditional validity for individual predictions is impossible for any distribution-free method as it has been mathematically proven. Any distribution-free method attempting to do so blows prediction intervals into infinite size with probability of 1.

36 of 44

UC Berkeley Special Series on Conformal Prediction: https://youtu.be/rvYnR0FGxM4?si=IlC8AEAJ8DRgXTyx

37 of 44

and

  • CP calibrate the predictions thus they are truly probabilities
  • CP allow the estimator to express its confidence on its predictions with prediction intervals

UPDATE OF SLIDE 2: WHAT IS CONFORMAL PREDICTION?

CP is a family of algorithms that compute how much the probability a prediction is “unusual” is a prediction:

the stranger the prediction is, the less we trust that prediction

NOTE: CP marginal validity is mathematically proven under data exchangeability assumption, although they are ways to empirically work around it (for instance, for time series) and a recent papers showed CP validity without the exchangeability requirement (see slide 4. HOW CONFORMAL PREDICTION WORKS?)

38 of 44

8. HOW TO EVALUATE CP

The calibration: if the predictions are a true probability:

P(y_true | prediction = p) = p

  • Classification: popular stricly proper metrics for binary and multiclass
    • Brier Score: (y_pred_positive_class - y_true_label)2
    • Log-Loss:
      • -log(y_pred_positive_class) for the positive class, and
      • -log(1-y_pred_positive_class) for the negative class

and the closer to zero, the better, for both metrics

  • General: calibration plots (aka reliability diagrams) where 1:1 diagonal means perfect calibration

39 of 44

8. HOW TO EVALUATE CP: CALIBRATION EXAMPLE I

40 of 44

8. HOW TO EVALUATE CP: CALIBRATION EXAMPLE II

41 of 44

The prediction intervals/sets provide confidence in the predictions and are evaluated by calculating their:

  • coverage: whether intervals/sets include the ground truth:

P(y_true in PI) >= 1- alpha

  • efficiency (aka sharpness): how narrow = informative are:
    • in classification: cardinality, sometimes also care about the number of empty and single-label sets
    • in regression/time series: popular metrics are Continuous Ranked Probability Score (CRPS) for distributions and Winkler score (kind of an approximation of CRPS for intervals), both also account for coverage.

8. HOW TO EVALUATE CP

42 of 44

8. HOW TO EVALUATE CP: PREDICTION INTERVALS EXAMPLE I

43 of 44

8. HOW TO EVALUATE CP: PREDICTION INTERVALS EXAMPLE II

44 of 44

“All other alternative Uncertainty Quantification methods do not have in-built validity guarantees. In the first independent comparative study of all four classes of uncertainty quantification methods, only Conformal Prediction satisfied the property of validity.” (aka coverage)