Model Calibration: Temperature Scaling, Platt Scaling, and ECE in Practice

A model that outputs 0.9 confidence on a prediction should be correct about 90% of the time. In practice, most neural networks are not calibrated this way — they are overconfident, assigning high probabilities to predictions that are wrong far more often than the confidence suggests. This matters for any application where downstream decisions are made based on model confidence: anomaly detection thresholds, human-in-the-loop review queues, risk scoring, and uncertainty-aware pipelines all depend on probabilities that actually reflect likelihood. This article covers how to measure calibration with Expected Calibration Error (ECE) and reliability diagrams, and how to fix poor calibration with temperature scaling and Platt scaling.

Why Neural Networks Are Overconfident

Modern deep networks trained with cross-entropy loss are incentivised to push softmax outputs toward 0 and 1, not toward the true class frequency. Cross-entropy loss rewards models that assign very high probability to the correct class — there is no explicit penalty for being overconfident on examples the model gets right. Large models with high capacity and batch normalisation are particularly prone to overconfidence because they can fit training data so well that the logit magnitudes grow large, which pushes softmax outputs toward the extremes. The problem was systematically documented by Guo et al. (2017) who showed that modern ResNets are significantly worse-calibrated than older, shallower networks, despite being more accurate.

Measuring Calibration: ECE and Reliability Diagrams

Expected Calibration Error (ECE) quantifies the gap between predicted confidence and actual accuracy. The calculation bins predictions by confidence, computes the accuracy within each bin, and averages the absolute gaps weighted by bin size.

import numpy as np
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

def compute_ece(probs: np.ndarray, labels: np.ndarray,
                n_bins: int = 15) -> float:
    """
    Expected Calibration Error.
    probs:  (N,) — max predicted probability (confidence)
    labels: (N,) — 1 if prediction was correct, 0 otherwise
    """
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    for i in range(n_bins):
        lo, hi = bin_boundaries[i], bin_boundaries[i + 1]
        mask = (probs >= lo) & (probs < hi)
        if mask.sum() == 0:
            continue
        bin_confidence = probs[mask].mean()
        bin_accuracy   = labels[mask].mean()
        bin_weight     = mask.sum() / len(probs)
        ece += bin_weight * abs(bin_accuracy - bin_confidence)
    return float(ece)

def reliability_diagram(probs: np.ndarray, labels: np.ndarray,
                         n_bins: int = 15, title: str = "Reliability Diagram"):
    """Plot confidence vs accuracy with gap shading."""
    bin_boundaries = np.linspace(0, 1, n_bins + 1)
    bin_midpoints, bin_accs, bin_confs, bin_counts = [], [], [], []

    for i in range(n_bins):
        lo, hi = bin_boundaries[i], bin_boundaries[i + 1]
        mask = (probs >= lo) & (probs < hi)
        if mask.sum() == 0:
            continue
        bin_midpoints.append((lo + hi) / 2)
        bin_accs.append(labels[mask].mean())
        bin_confs.append(probs[mask].mean())
        bin_counts.append(mask.sum())

    fig, ax = plt.subplots(figsize=(6, 6))
    ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
    ax.bar(bin_midpoints, bin_accs, width=1/n_bins, alpha=0.7,
           label='Accuracy', color='steelblue', align='center')
    ax.bar(bin_midpoints, bin_confs, width=1/n_bins, alpha=0.3,
           label='Gap (overconfidence)', color='red', align='center')
    ece = compute_ece(probs, labels, n_bins)
    ax.set_title(f"{title} | ECE = {ece:.4f}")
    ax.set_xlabel("Confidence"); ax.set_ylabel("Accuracy")
    ax.legend(); plt.tight_layout()
    return ece

# Usage with a PyTorch model
def evaluate_calibration(model, dataloader, device='cuda'):
    model.eval()
    all_probs, all_correct = [], []
    with torch.no_grad():
        for inputs, targets in dataloader:
            inputs, targets = inputs.to(device), targets.to(device)
            logits = model(inputs)
            probs = F.softmax(logits, dim=-1)
            confidence, predicted = probs.max(dim=-1)
            correct = (predicted == targets).float()
            all_probs.extend(confidence.cpu().numpy())
            all_correct.extend(correct.cpu().numpy())
    probs_arr   = np.array(all_probs)
    correct_arr = np.array(all_correct)
    ece = reliability_diagram(probs_arr, correct_arr)
    print(f"ECE: {ece:.4f}")
    return ece

A well-calibrated model produces a reliability diagram where the bars closely follow the diagonal. Overconfident models show bars that are shorter than the diagonal — the model says 0.9 but is right only 70% of the time in that bin. Underconfident models show bars above the diagonal — the model says 0.6 but is actually right 80% of the time.

Temperature Scaling

Temperature scaling is the simplest and most effective post-hoc calibration method. It learns a single scalar parameter T (the temperature) on a held-out validation set, then divides all logits by T before applying softmax at inference time. T > 1 softens the distribution (reduces overconfidence), T < 1 sharpens it. Because it is a single parameter, it cannot distort the model's accuracy — it only rescales the confidence values, not the ranking of predictions.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class TemperatureScaling(nn.Module):
    """Post-hoc temperature scaling calibration.
    
    Wraps an existing model and learns a single temperature parameter
    on a validation set. Does not change model accuracy — only calibration.
    """
    def __init__(self, model: nn.Module):
        super().__init__()
        self.model = model
        # Initialise T=1 (no change) — will be optimised on val set
        self.temperature = nn.Parameter(torch.ones(1))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        logits = self.model(x)
        return self.scaled_logits(logits)

    def scaled_logits(self, logits: torch.Tensor) -> torch.Tensor:
        return logits / self.temperature.clamp(min=0.05)  # avoid T -> 0

    def fit(self, val_loader, device='cuda', max_iter=50, lr=0.01):
        """Optimise temperature on validation set using NLL loss."""
        self.model.eval()
        self.to(device)

        # Collect all val logits and labels first (no grad needed for base model)
        logits_list, labels_list = [], []
        with torch.no_grad():
            for inputs, targets in val_loader:
                inputs = inputs.to(device)
                logits_list.append(self.model(inputs).cpu())
                labels_list.append(targets)
        all_logits = torch.cat(logits_list)
        all_labels = torch.cat(labels_list)

        # Optimise temperature on cached logits
        optimizer = optim.LBFGS([self.temperature], lr=lr, max_iter=max_iter)
        nll_criterion = nn.CrossEntropyLoss()

        def eval_fn():
            optimizer.zero_grad()
            scaled = all_logits.to(device) / self.temperature.clamp(min=0.05)
            loss = nll_criterion(scaled, all_labels.to(device))
            loss.backward()
            return loss

        optimizer.step(eval_fn)
        print(f"Optimal temperature: {self.temperature.item():.4f}")
        return self

# Usage
base_model = ...  # your trained model
calibrated = TemperatureScaling(base_model).fit(val_loader)

# At inference — just use calibrated instead of base_model
with torch.no_grad():
    logits = calibrated(inputs)
    probs = F.softmax(logits, dim=-1)

LBFGS converges faster than SGD for this single-parameter optimisation. A typical optimal temperature for a modern overconfident model is between 1.5 and 3.0. If your optimal T is below 1.0, the model is underconfident, which is less common but can happen with models trained with heavy dropout or label smoothing.

Platt Scaling for Binary Classification

Platt scaling fits a logistic regression on top of the model's raw output scores to learn a calibrated probability mapping. It is more flexible than temperature scaling (it learns two parameters: a scale and a bias) but only works naturally for binary classification. For multiclass problems, one-vs-rest Platt scaling or temperature scaling is preferred.

from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve
import numpy as np

def platt_scaling(val_scores: np.ndarray, val_labels: np.ndarray,
                  test_scores: np.ndarray) -> np.ndarray:
    """
    Fit Platt scaling on validation scores and apply to test scores.
    val_scores:  (N,) raw model scores (logits or probabilities)
    val_labels:  (N,) binary labels {0, 1}
    test_scores: (M,) raw model scores for calibration
    Returns:     (M,) calibrated probabilities
    """
    lr = LogisticRegression(C=1e10)  # high C = minimal regularisation
    lr.fit(val_scores.reshape(-1, 1), val_labels)
    calibrated = lr.predict_proba(test_scores.reshape(-1, 1))[:, 1]
    return calibrated

# Compare calibration before and after
from sklearn.metrics import brier_score_loss

raw_brier    = brier_score_loss(val_labels, val_scores)
platt_probs  = platt_scaling(val_scores, val_labels, val_scores)
platt_brier  = brier_score_loss(val_labels, platt_probs)
print(f"Brier score — raw: {raw_brier:.4f} | Platt: {platt_brier:.4f}")

Calibration for LLMs and Generative Models

Calibration for LLMs is more complex than for classifiers because the output is a sequence rather than a single probability. Two practical approaches are used in production. The first is token-level calibration: measure whether the model's per-token probabilities are well-calibrated on a benchmark (if the model assigns 0.8 probability to a token, is it correct 80% of the time?). Temperature scaling at the decoding level — adjusting the sampling temperature — is equivalent to post-hoc token-level calibration. The second approach is answer-level calibration for multiple-choice or extractive QA tasks: extract the model's probability for each candidate answer and measure whether those probabilities match empirical accuracy.

def llm_answer_calibration(model, tokenizer, questions: list[str],
                            choices: list[list[str]],
                            correct_indices: list[int],
                            n_bins: int = 10):
    """
    Measure calibration of an LLM on multiple-choice questions.
    Computes probability of each choice via log-likelihood scoring.
    """
    all_confs, all_correct = [], []
    model.eval()

    for question, opts, correct_idx in zip(questions, choices, correct_indices):
        log_probs = []
        for option in opts:
            prompt = f"{question}
Answer: {option}"
            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
            with torch.no_grad():
                outputs = model(**inputs, labels=inputs["input_ids"])
            # Negative NLL = log probability of the sequence
            log_probs.append(-outputs.loss.item() * inputs["input_ids"].shape[1])

        # Softmax over log-probs to get answer probabilities
        log_probs_t = torch.tensor(log_probs)
        probs = F.softmax(log_probs_t, dim=0).numpy()
        predicted = np.argmax(probs)
        all_confs.append(probs[predicted])
        all_correct.append(float(predicted == correct_idx))

    ece = compute_ece(np.array(all_confs), np.array(all_correct), n_bins)
    print(f"LLM calibration ECE: {ece:.4f}")
    return ece

When Calibration Matters Most in Production

Calibration is not important for every ML application. If your system uses a model purely for ranking — returning the top-k items ordered by score — calibration is irrelevant because ranking is preserved under monotonic transformations of the score, and temperature scaling is monotonic. Calibration becomes critical when: (1) you threshold on confidence to decide whether to take an action (only act if model confidence > 0.85), (2) you weight predictions by confidence in a downstream calculation (weighted average of model scores), (3) you use model confidence to route examples to a human reviewer (low-confidence predictions go to humans), or (4) you combine predictions from multiple models where miscalibrated probabilities would distort the combination.

In practice, the most common production scenario where engineers discover their model is miscalibrated is when they set a confidence threshold for a human-in-the-loop system. They set a threshold of 0.9 expecting 90% precision on auto-approved items, but actually get 75% precision because the model's 0.9 confidence predictions are only correct 75% of the time. Temperature scaling takes about ten minutes to implement and run, and it typically closes most of this gap for classification models.

Isotonic Regression and Histogram Binning

Temperature scaling and Platt scaling are parametric — they fit a specific functional form to the calibration mapping. Isotonic regression and histogram binning are non-parametric alternatives that can correct non-monotonic miscalibration that parametric methods cannot capture.

from sklearn.isotonic import IsotonicRegression
from sklearn.calibration import CalibratedClassifierCV
import numpy as np

def isotonic_calibration(val_probs: np.ndarray, val_labels: np.ndarray,
                          test_probs: np.ndarray) -> np.ndarray:
    """
    Non-parametric isotonic regression calibration.
    Fits a monotone step function mapping raw probs to calibrated probs.
    Works well when calibration error is non-monotonic (unusual shapes).
    Requires more validation data than temperature scaling (~1000+ samples).
    """
    ir = IsotonicRegression(out_of_bounds='clip')
    ir.fit(val_probs, val_labels)
    return ir.transform(test_probs)

# Histogram binning: simpler, less flexible
def histogram_binning(val_probs, val_labels, test_probs, n_bins=15):
    """Map each bin's average confidence to its empirical accuracy."""
    bins = np.linspace(0, 1, n_bins + 1)
    calibrated = np.zeros_like(test_probs)
    for i in range(n_bins):
        lo, hi = bins[i], bins[i+1]
        mask_val = (val_probs >= lo) & (val_probs < hi)
        mask_test = (test_probs >= lo) & (test_probs < hi)
        if mask_val.sum() > 0:
            calibrated[mask_test] = val_labels[mask_val].mean()
        else:
            calibrated[mask_test] = (lo + hi) / 2  # fallback to midpoint
    return calibrated

Isotonic regression requires more validation data than temperature scaling to avoid overfitting the calibration mapping — as a rule of thumb, use temperature scaling with fewer than 1,000 validation examples and consider isotonic regression only with 5,000 or more. The benefit of isotonic regression is that it can handle U-shaped or non-monotonic miscalibration patterns, which occasionally appear in models trained with unusual loss functions or on imbalanced data.

Calibration After Fine-Tuning

A commonly overlooked issue is that calibration needs to be re-evaluated after every fine-tuning step, not just after initial training. Fine-tuning a pretrained model on a new domain or task almost always changes the model's calibration, often making it worse — the model learns to be more confident on the new task's distribution while its calibration on the original temperature parameter degrades. The correct workflow is: (1) fine-tune the model, (2) hold out a calibration validation set from the fine-tuning domain, (3) fit temperature scaling on that set, (4) evaluate ECE before and after. This calibration step adds minimal overhead — temperature scaling on a few thousand examples takes seconds — and is worth making a standard part of every model release checklist. For models deployed with a confidence threshold, recalibrating after fine-tuning and adjusting the threshold to maintain the target precision is the difference between a system that works as intended and one that silently degrades.

Brier Score: A Calibration-Sensitive Metric Worth Tracking

ECE is useful for visualising calibration across confidence bins, but it has a known sensitivity to the number of bins and how they are defined. The Brier score is a bin-free alternative that measures calibration and resolution simultaneously. For binary classification, the Brier score is the mean squared error between predicted probabilities and true labels: a perfect model scores 0, a model that always predicts 0.5 scores 0.25, and a model that confidently predicts the wrong class scores up to 1.0. Unlike accuracy, the Brier score rewards well-calibrated confidence — a model that correctly predicts with 0.7 confidence scores better than one that predicts with 0.99 confidence on the same examples, because the 0.99 model is penalised for its excess confidence on any examples it gets wrong.

In production monitoring, tracking Brier score alongside accuracy gives you a signal for when the model's confidence distribution is drifting even if its rank-order accuracy is stable. A model whose accuracy holds steady at 85% but whose Brier score rises from 0.10 to 0.16 over several weeks is becoming more overconfident — its predictions are spreading toward the extremes even though the ranking of predictions is unchanged. This is a common pattern when a model is being served on distribution-shifted data and is worth catching before you start trusting the model's confidence values for threshold-based decisions.

Multi-Class Calibration with Classwise ECE

Standard ECE measures calibration of the max-confidence prediction, which can miss per-class miscalibration. Classwise ECE (also called TCE — Top-label Calibration Error) computes ECE separately for each class and averages them, which gives a more complete picture when class frequencies are imbalanced or when some classes are systematically better calibrated than others.

def classwise_ece(probs: np.ndarray, labels: np.ndarray,
                  n_bins: int = 15) -> dict:
    """
    Compute ECE per class and macro-average.
    probs:  (N, C) predicted probabilities for all classes
    labels: (N,)  true class indices
    """
    n_classes = probs.shape[1]
    class_eces = {}
    for c in range(n_classes):
        # One-vs-rest: is the model calibrated for class c specifically?
        class_probs   = probs[:, c]
        class_correct = (labels == c).astype(float)
        class_eces = compute_ece(class_probs, class_correct, n_bins)

    macro_ece = np.mean(list(class_eces.values()))
    return {"per_class": class_eces, "macro_ece": macro_ece}

In imbalanced datasets, standard ECE can look good because it is dominated by the majority class, while minority classes remain severely miscalibrated. Classwise ECE surfaces this — if class 0 (majority) has ECE=0.02 but class 3 (rare anomaly) has ECE=0.31, the system cannot reliably use class 3 confidence as a threshold signal even if overall ECE looks acceptable.

Calibration in the Context of Label Smoothing

Label smoothing, commonly used to reduce overconfidence during training, has a complex relationship with calibration. It works by replacing hard 0/1 targets with soft targets (e.g., 0.9 for the correct class and ε/(C-1) for others), which prevents the model from pushing logits to infinity. In practice, label smoothing does improve calibration compared to standard cross-entropy training, but it does not eliminate the need for post-hoc calibration — and it interacts with temperature scaling in a non-obvious way. Models trained with label smoothing tend to have lower optimal temperatures (sometimes below 1.0), meaning they are sometimes underconfident rather than overconfident. This can surprise teams who apply temperature scaling and find that the optimal T is 0.7 rather than the expected >1.0. The correct response is to apply whatever T the optimisation finds, regardless of direction, since both over- and underconfidence degrade threshold-based decision systems.

The practical takeaway is simple: regardless of whether you use label smoothing, dropout, or any other training regularisation that affects confidence, always measure ECE on a held-out validation set before deploying a model that will be used with confidence thresholds. The cost is low, the information is high-value, and the alternative — discovering post-deployment that your 90% confidence threshold is actually 70% precision — is expensive to fix.

Recommended Calibration Workflow

The workflow that covers most production cases is: train your model normally, hold out 5–10% of your training distribution as a calibration set (separate from your test set), compute ECE and a reliability diagram on that set, apply temperature scaling if ECE is above 0.05, re-evaluate ECE and the Brier score to confirm improvement, and record the optimal temperature as a model artifact alongside the weights. At inference time, divide logits by the stored temperature before applying softmax. This adds a single scalar division to the inference path — effectively zero overhead — and typically cuts ECE by 50–80% for overconfident classification models. For models served via an API where you control the output layer, this calibration can be applied server-side transparently to all consumers. For embedded or edge models where you cannot modify the inference graph, temperature scaling can be applied as a post-processing step on the output probabilities: if T is the optimal temperature found during calibration, the calibrated probability for class c is proportional to p_c^(1/T), where p_c is the raw softmax output — a relationship that holds because dividing logits by T is equivalent to raising probabilities to the power 1/T when the denominator is held constant.

Leave a Comment