Conformal Prediction in Python: Trustworthy ML Intervals — ContentBuffer guide

Conformal Prediction in Python: Trustworthy ML Intervals

K
Kodetra Technologies··9 min read Intermediate

Summary

Wrap any scikit-learn model in MAPIE for prediction intervals with guaranteed coverage.

Your model predicts a house will sell for $420,000. Should the buyer trust that number to the dollar? Almost never. A single point prediction hides how unsure the model actually is, and in 2026 — with ML quietly driving pricing engines, clinical triage, and automated lending — “how confident are you?” has turned into a compliance question rather than a nice-to-have.

Conformal prediction answers it with a guarantee. Instead of one number, your model returns a range that contains the true value a chosen fraction of the time — say 90%. The surprising part: that guarantee holds for any model, with no assumption that your errors are Gaussian, and without retraining anything. This guide shows you how to bolt that guarantee onto an existing scikit-learn model using MAPIE (Model Agnostic Prediction Interval Estimator), whose redesigned v1 API is now the standard way to do conformal prediction in Python.

By the end you will build calibrated intervals for a regressor, prediction sets for a classifier, verify that the coverage promise actually holds on held-out data, and sidestep the handful of mistakes that silently break it.

What conformal prediction actually does

Most uncertainty methods make assumptions: linear regression assumes normally distributed residuals, Bayesian models assume a prior. Conformal prediction makes essentially one assumption — that your data is exchangeable (roughly, the order of rows does not matter, which holds for ordinary i.i.d. data). Given that, it converts any model’s raw outputs into intervals with a finite-sample coverage guarantee.

The mechanism is simple enough to picture. You hold out a fresh chunk of labeled data the model never trained on — the conformalization set (older papers call it the calibration set). You measure how wrong the model is on every point there; for regression that “wrongness” is the absolute residual. You then take the 90th percentile of those errors and call it q. To predict on a new point, you output the model’s estimate plus or minus q. Because the new point and the conformalization points are exchangeable, the true value lands inside that band about 90% of the time. That is the whole idea: rank the errors you have already seen, and let that ranking define how wide the future band must be.

A 60-second numeric version makes it concrete. Suppose your conformalization set has 1,000 points and you record the absolute residual for each. Sort them ascending. For 90% coverage, conformal theory tells you to pick the residual at rank ⌈(n+1)·0.9⌉ = ⌈900.9⌉ = the 901st value — a slightly conservative index that accounts for the new point being unseen. If that residual is 80, then every prediction gets a band of ±80. Push the target to 95% and you slide to the 951st residual, which is larger, so the band widens. Nothing about the model changed; you simply read a different position in the same sorted list. That is why conformal prediction is called distribution-free: it never fits a bell curve, it just counts.

Prerequisites

  • Comfort with scikit-learn’s fit / predict workflow.
  • Python 3.9+ with NumPy and scikit-learn 1.4 or newer.
  • A trained or trainable regressor or classifier you want to add intervals to.
  • Basic familiarity with train/test splitting — conformal prediction adds one more split.

Step 1 — Install MAPIE

pip install mapie scikit-learn numpy
# MAPIE v1 needs Python >=3.9, scikit-learn >=1.4, NumPy >=1.23

MAPIE’s only runtime dependencies are scikit-learn and NumPy, so it drops into an existing project without dragging in a deep-learning stack. It works with any estimator that exposes the scikit-learn interface, including XGBoost, LightGBM, and Keras wrappers.

Step 2 — Split into three sets, not two

This is the one structural change conformal prediction asks of you. A normal workflow has train and test. Conformal prediction needs a third, independent slice — the conformalization set — used only to measure errors. If you reuse training data here, the model has already seen it, the errors look artificially small, and your intervals come out too narrow. MAPIE ships a helper that produces all three at once:

import numpy as np
from sklearn.datasets import make_regression
from mapie.utils import train_conformalize_test_split

X, y = make_regression(n_samples=5000, n_features=8, noise=25.0, random_state=42)

# One call gives you THREE sets: train, conformalize (calibration), test.
(X_train, X_conf, X_test,
 y_train, y_conf, y_test) = train_conformalize_test_split(
    X, y,
    train_size=0.6, conformalize_size=0.2, test_size=0.2,
    random_state=42)

print(X_train.shape, X_conf.shape, X_test.shape)
# (3000, 8) (1000, 8) (1000, 8)

Step 3 — Wrap a regressor and get intervals

SplitConformalRegressor is the workhorse. With prefit=False it trains the base model for you on the training set, then conformalize learns the error quantile on the conformalization set. The payoff is predict_interval, which hands back both the point prediction and the lower/upper bounds.

from sklearn.ensemble import RandomForestRegressor
from mapie.regression import SplitConformalRegressor

scr = SplitConformalRegressor(
    estimator=RandomForestRegressor(n_estimators=100, random_state=42),
    confidence_level=0.9,   # we want 90% coverage
    prefit=False)           # MAPIE will fit the model for us

scr.fit(X_train, y_train)            # train the base model
scr.conformalize(X_conf, y_conf)     # learn the conformity quantile

# predict_interval returns the point prediction AND the interval
y_pred, y_int = scr.predict_interval(X_test)

print(y_pred.shape, y_int.shape)
# (1000,) (1000, 2, 1)   ->  [lower, upper] per sample, per confidence level

for i in range(3):
    lo, hi = y_int[i, 0, 0], y_int[i, 1, 0]
    print(f"pred={y_pred[i]:8.2f}   90% interval=[{lo:8.2f}, {hi:8.2f}]")
# pred= -153.13   90% interval=[ -234.01,  -72.24]
# pred=    4.46   90% interval=[  -76.42,   85.35]
# pred= -123.88   90% interval=[ -204.77,   -43.00]

The returned y_int has shape (n_samples, 2, n_levels): index 0 is the lower bound, index 1 the upper, and the last axis lets you request several confidence levels at once. Notice the intervals are not symmetric mirror images of a textbook formula — their width comes directly from the empirical error distribution of your model on your data.

Step 4 — Verify the guarantee actually holds

A coverage promise you never check is just a hope. The honest test is to compute empirical coverage on the test set: the fraction of true values that actually fell inside their intervals. It should sit right around your target.

from mapie.metrics.regression import (
    regression_coverage_score, regression_mean_width_score)

coverage = regression_coverage_score(y_test, y_int)[0]
width    = regression_mean_width_score(y_int)[0]

print(f"empirical coverage: {coverage:.3f}")   # empirical coverage: 0.910
print(f"mean interval width: {width:.2f}")      # mean interval width: 161.77

Empirical coverage of 0.91 against a 0.90 target is exactly what you want to see — the small overshoot is normal sampling noise. Always report coverage and mean width together. Coverage tells you the intervals are honest; width tells you whether they are useful. An interval from negative infinity to infinity has perfect coverage and zero value. The art is getting the width as tight as possible while holding coverage, which is mostly a matter of using a better base model.

Choosing the confidence level

confidence_level is the one knob you set deliberately, and it is a business decision more than a statistical one. Higher coverage means wider, safer intervals; lower coverage means tighter, riskier ones. The relationship is not linear — pushing from 90% to 99% can double an interval’s width because you are reaching deep into the tail of the error distribution where a few large residuals dominate. Pick the level from the cost of being wrong: a 99% interval for a medical dosage, perhaps 80% for an internal demand forecast where a human reviews outliers anyway. You can also request several levels in one call by passing a list to confidence_level, which is handy for showing a tight inner band and a cautious outer band on the same chart.

Step 5 — Prediction sets for classification

For classification the analogue of an interval is a prediction set: a set of labels guaranteed to contain the true class with your chosen probability. When the model is confident the set holds one label; when it is unsure the set grows to two or three. That is genuinely useful — a triage system can auto-resolve singletons and route ambiguous multi-label cases to a human.

Here we pass an already-trained classifier with prefit=True and use the "aps" (Adaptive Prediction Sets) score, which produces sets whose size adapts to local difficulty:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from mapie.classification import SplitConformalClassifier
from mapie.metrics.classification import classification_coverage_score

X, y = make_classification(n_samples=6000, n_features=10,
                           n_informative=6, n_classes=4, random_state=42)
(X_tr, X_conf, X_test,
 y_tr, y_conf, y_test) = train_conformalize_test_split(
    X, y, train_size=0.6, conformalize_size=0.2, test_size=0.2, random_state=42)

# Train your classifier however you like, then pass it in prefit.
clf = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_tr, y_tr)

scc = SplitConformalClassifier(
    estimator=clf,
    confidence_level=0.9,
    conformity_score="aps",   # Adaptive Prediction Sets
    prefit=True)
scc.conformalize(X_conf, y_conf)

y_pred, y_set = scc.predict_set(X_test)
# y_set is a boolean array: (n_samples, n_classes, n_levels)

for i in range(3):
    classes_in_set = np.where(y_set[i, :, 0])[0].tolist()
    print(f"sample {i}: predicted classes = {classes_in_set}")
# sample 0: predicted classes = [2, 3]
# sample 1: predicted classes = [0, 1]
# sample 2: predicted classes = [3]

coverage = classification_coverage_score(y_test, y_set)[0]
avg_size = y_set[:, :, 0].sum(axis=1).mean()
print(f"coverage={coverage:.3f}  avg set size={avg_size:.2f}")
# coverage=0.978  avg set size=1.91

Sample 2 gets a confident singleton, while samples 0 and 1 return two candidates each. The average set size of 1.91 is the classification version of interval width: smaller is more informative, as long as coverage stays at target. Here coverage is 0.978 — comfortably above 0.90, which APS can produce because it errs toward slightly larger sets to stay safe.

Common pitfalls

Conformalizing on training data. The single most common mistake. If the conformalization set overlaps the training set, residuals are too optimistic and intervals come out too tight — coverage will quietly fall below target in production. Keep the three splits strictly disjoint.

Breaking exchangeability. The guarantee assumes the conformalization and test data come from the same distribution. Time series violate this directly: tomorrow does not look like yesterday. For temporal data use TimeSeriesRegressor with a method such as EnbPI rather than the plain split estimator, or coverage will drift as the series evolves.

A conformalization set that is too small. The error quantile is itself estimated from data, so a 90% interval built from 50 points is noisy. Aim for at least a few hundred conformalization examples; a thousand is comfortable. Below that, expect coverage to wobble run to run.

Reading coverage on the wrong split. Measure coverage on the test set, never on the conformalization set — the quantile was tuned to that set, so its coverage there is meaningless by construction.

Confusing wide intervals with a MAPIE bug. If your bands are huge, MAPIE is doing its job: it is honestly reporting that your base model is uncertain. The fix is a better model or more features, not a different conformal method.

Quick reference

TaskClassKey methodOutput
Regression intervalsSplitConformalRegressorpredict_intervalpoint + [lower, upper]
Classification setsSplitConformalClassifierpredict_setpoint + boolean label set
Quantile-based intervalsConformalizedQuantileRegressorpredict_intervalasymmetric intervals
Time seriesTimeSeriesRegressorpredict_intervaladaptive intervals (EnbPI)
Make the 3-way splittrain_conformalize_test_splittrain / conf / test sets
Check regression coverageregression_coverage_scoreempirical coverage 0–1
Check classification coverageclassification_coverage_scoreempirical coverage 0–1

Next steps

You now have the core loop: split three ways, fit, conformalize, predict, and verify coverage. From here, three upgrades pay off quickly. Swap SplitConformalRegressor for ConformalizedQuantileRegressor when your noise grows with the prediction — it produces intervals that widen exactly where the data is noisier instead of a constant band. Reach for CrossConformalRegressor when data is scarce and you cannot spare a dedicated conformalization slice; it reuses every point through cross-validation. And if you serve groups with different error profiles — regions, customer tiers — look at MAPIE’s Mondrian wrappers, which guarantee coverage within each group rather than only on average. Start by adding predict_interval to one model you already ship, watch the coverage number, and let an honest measure of uncertainty replace a falsely precise point estimate.

Comments

Subscribe to join the conversation...

Be the first to comment