Key Concepts of Conformal Prediction
María Moreno de Castro, July 2024
24 July 1998
2022
(First Ed. 2005)
Nowadays there are many packages in Python implementing Conformal Prediction methods
and in R:
Nowadays many companies apply Conformal Prediction methods
Nvidia Research uses Conformal Prediction since 2022
Adaptive Conformal Inference (ACI) by Gibbs and Candes (NeurIPS 2021)
https://proceedings.neurips.cc/paper/2021/hash/0d441de75945e5acbc865406fc9a2559-Abstract.html
2. WHAT IS CONFORMAL PREDICTION?
CP is a family of algorithms that compute how much “unusual” is a prediction:
the stranger the prediction is, the less we trust that prediction
3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?
Performance metrics evaluate the model estimator correctness:
Weather forecast: Rain (1), No rain (0)
Predictions: day 1: 80%, day 2: 20%, …
Observations: day 1: 1, day 2: 0, ...
the model is correct those times predictions match observations,
e.g. ACC = 0.9
3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?
Performance metrics evaluate the model estimator correctness:
Weather forecast: Rain (1), No rain (0)
Predictions: day 1: 80%, day 2: 20%, …
Observations: day 1: 1, day 2: 0, ...
the model is correct those times predictions match observations,
e.g. ACC = 0.9
Even for a perfect model estimator,
not all the instances are equally hard to predict.
With a prediction range, we could distinguish:
https://robot-help.github.io ←super very revealing short video
3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?
3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?
13
Vincent Warmerdam’s “Mean Squared Terror” https://koaning.io/posts/mean-squared-terror/
3. WHY TO LOOK AT PERFORMANCE METRICS IS NOT ENOUGH?
Prediction intervals: those that we calculate with CP, with statistical guarantee contain the ground truth, i.e., they meet validity, also called "coverage guarantee". In classification, Computer Visión, and NLP, they are not continuous intervals, e.g. [5.5, 6.5] but sets of potential outcomes, e.g. [class B, class C]. If the prediction is very certain, its prediction interval/set just a value, like a single-point prediction.
Confidence intervals of frequentist statistics: associated with the uncertainty in estimating a population parameter (e.g. the expected value) from a sampling parameter (e.g. the mean). They also indicate a range but in this case they indicate a dispersion around the estimator (e.g. the mean), not the uncertainty of an individual prediction. The classic example: [mean - std, mean + std]. They do not fulfill validity.
Credible intervals of posteriors in Bayesian statistics: like our prediction intervals, these also include the prediction with a given probability, i.e., they treat their bounds as fixed and the prediction as a random variable (frequentist confidence intervals treat their bounds as random variables and the prediction as a fixed value). They require knowledge of the a priori distribution that is specific to the inputs of each situation (not at all simple). They do not fulfill validity.
14
Intervals types and their validity
4. HOW CONFORMAL PREDICTION WORKS?
More info about TCP vs ICP here: https://valeman.medium.com/how-to-use-full-transductive -conformal-prediction and other CP forms here: https://mlwithouttears.com/2024/02/04/ fifty-four-actually-shades-of-conformal-prediction/
4. HOW CONFORMAL PREDICTION WORKS?
April 2025
Validity is mathematically guaranteed if the calibration and the test sets are exchangeable (that is weaker than the iid requirement assumed by ML models). However:
Sort the predictions by a non-conformity measure
If alpha is 0.05, we are setting that 95% of predictions will be considered less extreme than that threshold
Check if all of the potential predictions for a new instance, e.g. one of the test set, are more or less extreme than predictions of the calibration set
For each potential prediction of a new instance, if its non-conformity score > alpha (i.e. the score falls within the 95% of the non-conformity scores in the calibration set), that prediction for that new instance is not strange, then add it to the prediction set
Calibration set
BASIC STEPS:
Sort the predictions by a non-conformity measure
If alpha is 0.05, calculate the 95th-quantile, Q1-alpha,thus 95% of predictions are less extreme than that threshold
Calibration set
BASIC STEPS for regression:
For each new instance in the test set, calculate the prediction interval as [y_pred - Q1-alpha, y_pred + Q1-alpha]
These were just the basic general steps of vanilla inductive conformal prediction (ICP).
Every CP in the family modifies the steps in different ways to address different tasks:
about ML in general but including CP
Two Free Books to start with
about Time Series in general but including CP
“A prerequisite for trust in machine learning is uncertainty quantification. Without it, an accurate prediction and a wild guess look the same.”
see the code example of this presentation in the repo: https://github.com/MMdeCastro/Uncertainty_Quantification_XAI
Demo interactiva for Transductive CP for classification: https://cml.rhul.ac.uk/CP/index.html
List videos using regression: https://youtu.be/xZbuFKWV5NA?si=513LFry0Vr9Q3XV5
5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES
The predictions of the model estimator are not probabilities unless the model estimator is calibrated by default (which hardly occurs):
Weather forecast: Rain (1), No rain (0)
Predictions: day 1: 80%, day 2: 20%, …
Observations: day 1: 1, 0, ...
a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained
P(y_true | prediction = p) = p
“Are you sure that's a probability?” https://kiwidamien.github.io/are-you-sure-thats-a-probability.html
5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES
The predictions of the estimator are not probabilities unless the estimator is calibrated (which hardly occurs):
Weather forecast: Rain (1), No rain (0)
Predictions: day 1: 80%, day 2: 20%, …
Observations: day 1: 1, 0, ...
a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained
P(y_true | prediction = p) = p
Predictions are raw scores that passed
through a sigmoid, softmax,... that normalized them btw 0 and 1 so they add to 1, but they are not calibrated, they are not probabilities
1
2
3
4
example MLP in scikit learn:
activation = 'relu'
but that is for hidden layers, look at the output activation functions
example: Decision Tree for Regression
5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES
The predictions of the model estimator are not probabilities unless the model estimator is calibrated (which hardly occurs):
Weather forecast: Rain (1), No rain (0)
Predictions: day 1: 80%, day 2: 20%, …
Observations: day 1: 1, 0, ...
a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained
P(y_true | prediction = p) = p
TO CALIBRATE: to match y_pred with y_true
5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES
The predictions of the model estimator are not probabilities unless the model estimator is calibrated (which hardly occurs):
Weather forecast: Rain (1), No rain (0)
Predictions: day 1: 80%, day 2: 20%, …
Observations: day 1: 1, 0, ...
a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained
P(y_true | prediction = p) = p
TO CALIBRATE: to match y_pred with y_true
.predict_proba(x_i) |
0.8 |
0.5 |
0.4 |
true label: y_i |
1 |
0 |
1 |
5. WHY THE PREDICTIONS WERE NOT ALREADY PROBABILITIES
The predictions of the model estimator are not probabilities unless the model estimator is calibrated (which hardly occurs):
Weather forecast: Rain (1), No rain (0)
Predictions: day 1: 80%, day 2: 20%, …
Observations: day 1: 1, 0, ...
a prediction of 0.8 should mean that 80/100 of the days with prediction 80% rained
P(y_true | prediction = p) = p
TO CALIBRATE: to match y_pred with y_true
Calibration is good for performance too:
“Properties and Benefits of Calibrated Classifiers” https://link.springer.com/chapter/10.1007/978-3-540-30116-5_14
6. NON-CONFORMITY MEASURE (aka CONFORMITY) MEASURE
Manokhin’s Practical-Guide-Applied-Conformal-Prediction
6. NON-CONFORMITY MEASURE (aka CONFORMITY) MEASURE
Most popular non-conformity measures:
7. VALIDITY (aka COVERAGE GUARANTEE)
.predict_ proba(x_i) |
0.7 |
0.5 |
0.4 |
true label: y_i |
1 |
0 |
1 |
Validity is fulfilled if the prediction intervals contain the true label at the given confidence defined by alpha
it is only empirically/heuristically proven (e.g. Mondrian CP achieves it per class); in a distribution-free setting, conditional coverage is impossible to achieve by any mean for every point, as it has been theoretically proven
7. VALIDITY (aka COVERAGE GUARANTEE)
Conditional validity for individual predictions is impossible for any distribution-free method as it has been mathematically proven. Any distribution-free method attempting to do so blows prediction intervals into infinite size with probability of 1.
UC Berkeley Special Series on Conformal Prediction: https://youtu.be/rvYnR0FGxM4?si=IlC8AEAJ8DRgXTyx
and
UPDATE OF SLIDE 2: WHAT IS CONFORMAL PREDICTION?
CP is a family of algorithms that compute how much the probability a prediction is “unusual” is a prediction:
the stranger the prediction is, the less we trust that prediction
NOTE: CP marginal validity is mathematically proven under data exchangeability assumption, although they are ways to empirically work around it (for instance, for time series) and a recent papers showed CP validity without the exchangeability requirement (see slide 4. HOW CONFORMAL PREDICTION WORKS?)
8. HOW TO EVALUATE CP
The calibration: if the predictions are a true probability:
P(y_true | prediction = p) = p
and the closer to zero, the better, for both metrics
8. HOW TO EVALUATE CP: CALIBRATION EXAMPLE I
8. HOW TO EVALUATE CP: CALIBRATION EXAMPLE II
The prediction intervals/sets provide confidence in the predictions and are evaluated by calculating their:
P(y_true in PI) >= 1- alpha
8. HOW TO EVALUATE CP
8. HOW TO EVALUATE CP: PREDICTION INTERVALS EXAMPLE I
8. HOW TO EVALUATE CP: PREDICTION INTERVALS EXAMPLE II
“All other alternative Uncertainty Quantification methods do not have in-built validity guarantees. In the first independent comparative study of all four classes of uncertainty quantification methods, only Conformal Prediction satisfied the property of validity.” (aka coverage)
Dewolf et al. https://arxiv.org/abs/2107.00363