Label Smoothing: When It Helps and When It Hurts

Label smoothing is one of those training tricks that appears in virtually every modern image classification and NLP paper but is poorly explained in most tutorials. The standard explanation — “it prevents overconfidence by softening hard labels” — is correct but tells you nothing about when it actually improves performance, when it actively hurts it, and what value of smoothing to use. This article covers what label smoothing does mechanically, why it works when it works, the specific situations where it backfires, and how to implement and tune it correctly in PyTorch.

What Label Smoothing Does Mechanically

In standard cross-entropy training with one-hot labels, the loss function pushes the model to assign probability 1.0 to the correct class and 0.0 to all other classes. For a 10-class problem, the target distribution for a sample with class 3 is [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]. Cross-entropy loss is minimised at exactly this point, which means the model is rewarded for becoming arbitrarily confident — driving the logit for class 3 to +∞ relative to the others. In practice, this pushes the model towards large logit magnitudes and produces probability distributions that are more peaked than the true data-generating distribution, which causes miscalibrated confidence scores and sometimes overfitting.

Label smoothing replaces the one-hot target with a softer distribution: ε/(K-1) for incorrect classes and (1 – ε) for the correct class, where ε is the smoothing parameter (typically 0.1) and K is the number of classes. For the same 10-class example with ε=0.1, the target becomes [0.011, 0.011, 0.011, 0.9, 0.011, 0.011, 0.011, 0.011, 0.011, 0.011]. The loss is now minimised when the model assigns 0.9 to the correct class rather than 1.0, which prevents the logits from growing without bound and produces more moderate, better-calibrated confidence values. The effective change to the loss function is a small additive penalty proportional to the KL divergence between the predicted distribution and a uniform distribution — label smoothing implicitly regularises the model towards distributional uncertainty.

Implementing Label Smoothing in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

# PyTorch's built-in label smoothing (added in 1.10)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

# Usage in a training loop
logits = model(inputs)          # (batch, num_classes)
loss = criterion(logits, targets)  # targets: (batch,) integer class indices

# Manual implementation if you need more control or are using older PyTorch
class LabelSmoothingLoss(nn.Module):
    def __init__(self, num_classes: int, smoothing: float = 0.1,
                 reduction: str = 'mean'):
        super().__init__()
        self.smoothing = smoothing
        self.num_classes = num_classes
        self.reduction = reduction

    def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        """
        logits: (batch, num_classes) — raw unnormalised scores
        targets: (batch,) — integer class indices
        """
        log_probs = F.log_softmax(logits, dim=-1)
        # Build smoothed target distribution
        with torch.no_grad():
            smooth_targets = torch.full_like(log_probs, self.smoothing / (self.num_classes - 1))
            smooth_targets.scatter_(1, targets.unsqueeze(1), 1.0 - self.smoothing)
        # NLL loss against smoothed targets
        loss = -(smooth_targets * log_probs).sum(dim=-1)
        if self.reduction == 'mean':
            return loss.mean()
        elif self.reduction == 'sum':
            return loss.sum()
        return loss

# Test both implementations produce equivalent results
criterion_builtin = nn.CrossEntropyLoss(label_smoothing=0.1)
criterion_manual = LabelSmoothingLoss(num_classes=10, smoothing=0.1)

logits = torch.randn(32, 10)
targets = torch.randint(0, 10, (32,))
loss_a = criterion_builtin(logits, targets)
loss_b = criterion_manual(logits, targets)
print(f"Built-in: {loss_a.item():.6f} | Manual: {loss_b.item():.6f}")
# Should be identical

The built-in nn.CrossEntropyLoss(label_smoothing=ε) is the right choice for all modern PyTorch training. The manual implementation is useful when you need to apply different smoothing values to different classes, want smoothing that is asymmetric (more smoothing on low-confidence classes), or are working with frameworks where the built-in does not support label smoothing natively.

For Sequence-to-Sequence and Language Modelling Tasks

Label smoothing in seq2seq and language modelling requires applying the smoothed loss at every token position while ignoring padding positions. The built-in nn.CrossEntropyLoss handles this correctly with the ignore_index argument.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Standard language modelling training with label smoothing
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
criterion = nn.CrossEntropyLoss(
    label_smoothing=0.1,
    ignore_index=-100,   # HuggingFace convention: -100 masks padding in labels
)

def compute_lm_loss(model, input_ids, attention_mask):
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits  # (batch, seq_len, vocab_size)
    # Shift: predict token t+1 from token t
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = input_ids[:, 1:].contiguous()
    # Mask padding tokens
    shift_labels[attention_mask[:, 1:] == 0] = -100
    # Flatten to (batch*seq_len, vocab_size) and (batch*seq_len,)
    loss = criterion(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
    )
    return loss

When Label Smoothing Helps

Label smoothing reliably improves performance in three specific settings. The first is large-scale image classification with many classes: the original Inception-v3 paper that introduced label smoothing showed consistent improvements on ImageNet (1000 classes), and the effect has been replicated across ResNets, ViTs, and EfficientNets. With many classes, the model is incentivised to drive logits for incorrect classes to large negative values, and label smoothing reduces this pressure by requiring a small positive probability on every class. The second is sequence-to-sequence tasks (machine translation, summarisation, seq2seq code generation) where the output vocabulary is large and the model can confidently overfit to specific output tokens even when multiple valid outputs exist. The third is fine-tuning pre-trained models on small datasets, where label smoothing acts as a regulariser that reduces overfitting to the limited training signal.

The mechanism through which label smoothing helps calibration is now well understood: by preventing the model from driving the correct class logit to arbitrarily large values, it keeps the predicted probability distribution more moderate, which means confidence scores are more aligned with actual accuracy. A model trained with label smoothing that says 0.85 on a prediction is more likely to actually be correct 85% of the time than a model trained without it that reports 0.99 confidence on the same example. This calibration benefit is often more practically valuable than the accuracy improvement, particularly in applications where downstream decisions depend on the model’s confidence score rather than just the argmax prediction.

When Label Smoothing Hurts

Label smoothing causes measurable harm in two important settings that are common in modern ML. The first is knowledge distillation. A 2020 paper by Müller et al. showed that models trained with label smoothing produce worse soft targets for distillation because the smoothing interferes with the information content of the probability distributions across incorrect classes. The soft predictions from a teacher model trained without label smoothing carry meaningful information about which incorrect classes the teacher found plausible — these relative probabilities between wrong classes are what makes distillation work. Label smoothing pushes those probabilities toward uniformity, destroying this signal. If you are training a teacher model whose outputs will be used for distillation, train without label smoothing.

The second case where label smoothing hurts is when the task has genuinely ambiguous or noisy labels and you have reason to believe the hard label is sometimes wrong. In this case, label smoothing applies the same soft target to all examples regardless of whether they are cleanly labelled or mislabelled, which can actually hurt performance compared to methods specifically designed for label noise (like learning with noisy labels approaches or mixture-based losses). Label smoothing is designed for confident labels that you want to soften slightly; it is not a substitute for label noise handling and conflating the two leads to suboptimal results.

import torch
import torch.nn as nn

# Training a teacher for distillation: NO label smoothing
teacher_criterion = nn.CrossEntropyLoss()  # hard labels only

# Distillation loss: combine soft teacher targets + hard ground truth
class DistillationLoss(nn.Module):
    def __init__(self, temperature: float = 4.0, alpha: float = 0.7):
        super().__init__()
        self.T = temperature
        self.alpha = alpha
        # Hard label loss: no smoothing here either
        self.ce = nn.CrossEntropyLoss()

    def forward(self, student_logits, teacher_logits, hard_labels):
        # Soft loss: KL between student and teacher at temperature T
        soft_student = torch.log_softmax(student_logits / self.T, dim=-1)
        soft_teacher = torch.softmax(teacher_logits / self.T, dim=-1)
        soft_loss = nn.functional.kl_div(soft_student, soft_teacher,
                                          reduction='batchmean') * (self.T ** 2)
        # Hard loss
        hard_loss = self.ce(student_logits, hard_labels)
        return self.alpha * soft_loss + (1 - self.alpha) * hard_loss

distill_criterion = DistillationLoss(temperature=4.0, alpha=0.7)

Choosing the Smoothing Value

The standard value is ε=0.1, used in the original Inception paper and subsequently adopted as a default across most frameworks and architectures. This is a reasonable starting point but not always optimal. For tasks with very large output vocabularies (machine translation with 30k+ tokens), ε=0.1 distributes a larger absolute probability mass across incorrect tokens than for small vocabulary tasks, which can over-regularise. For these tasks, values in the range 0.05–0.1 are more appropriate. For small classification problems (fewer than 10 classes), ε=0.1 is generally fine. For fine-tuning scenarios where the pretrained model already has strong priors, lower values (0.05) are often better to avoid destabilising the existing calibration.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

def find_best_smoothing(model, train_loader, val_loader, 
                        candidates=(0.0, 0.05, 0.1, 0.15, 0.2),
                        epochs=3, lr=1e-4):
    """Quick sweep over label smoothing values on a validation set."""
    results = {}
    for eps in candidates:
        # Reset model weights for fair comparison
        model_copy = type(model)()  # assumes model has a default constructor
        model_copy.load_state_dict(model.state_dict())
        optimizer = torch.optim.AdamW(model_copy.parameters(), lr=lr)
        criterion = nn.CrossEntropyLoss(label_smoothing=eps)
        # Train
        model_copy.train()
        for _ in range(epochs):
            for x, y in train_loader:
                optimizer.zero_grad()
                loss = criterion(model_copy(x), y)
                loss.backward()
                optimizer.step()
        # Evaluate
        model_copy.eval()
        correct = total = 0
        with torch.no_grad():
            for x, y in val_loader:
                preds = model_copy(x).argmax(dim=-1)
                correct += (preds == y).sum().item()
                total += len(y)
        val_acc = correct / total
        results[eps] = val_acc
        print(f"  ε={eps:.2f}: val_acc={val_acc:.4f}")
    best_eps = max(results, key=results.get)
    print(f"
Best ε={best_eps} with val_acc={results[best_eps]:.4f}")
    return best_eps

Label Smoothing and Model Calibration

One of the most underappreciated benefits of label smoothing is its effect on calibration — the alignment between predicted confidence and actual accuracy. A well-calibrated model that predicts 0.8 confidence is correct 80% of the time on those predictions. Overconfident models trained with hard labels often show calibration curves that are far above the diagonal: the model says 0.99 when it is actually right 85% of the time. Label smoothing shifts these calibration curves towards the diagonal, making confidence scores more reliable as estimates of actual correctness probability.

Measuring calibration improvement is straightforward with Expected Calibration Error (ECE), which bins predictions by confidence and measures the average gap between mean confidence and mean accuracy within each bin. A meaningful improvement in ECE — say, from 0.08 to 0.04 — indicates that label smoothing is having its intended regularising effect. For production systems where confidence scores drive downstream decisions (whether to surface a prediction to a user, whether to request human review, how to set decision thresholds), measuring ECE on your validation set with and without label smoothing is a concrete way to quantify the benefit beyond raw accuracy.

The practical summary: use label smoothing at ε=0.1 as a default for classification and seq2seq tasks, skip it when training teachers for distillation, and treat it as one regularisation tool among several rather than a guaranteed improvement. If your validation accuracy or calibration does not improve with label smoothing, remove it — it is not always beneficial, and there is no benefit to carrying training complexity that does not contribute to evaluation metrics.

Label Smoothing vs Other Regularisation Techniques

Label smoothing is one of several output-space regularisation techniques, and understanding how it compares to the alternatives helps you decide what combination to use. Dropout regularises activations by randomly zeroing them during training, which operates in the network’s representation space rather than the output space. Weight decay penalises large weights, encouraging smoother decision boundaries. Mixup interpolates both inputs and labels between pairs of training examples, which is a stronger form of output regularisation than label smoothing because it creates entirely new virtual training examples rather than just softening existing labels. Data augmentation adds variance to the input distribution without touching the labels.

Label smoothing and dropout are complementary and can be used simultaneously without interference — they act on different parts of the computation graph. Label smoothing and Mixup are also complementary but their combined effect is stronger regularisation than either alone, which can be too much for small datasets. In practice, for image classification tasks the combination of Mixup (α=0.2) with label smoothing (ε=0.1) is standard in modern training recipes, while for fine-tuning large pretrained models label smoothing alone is often sufficient regularisation given that the pretrained model’s representations are already well-regularised.

Temperature scaling, a post-training calibration technique, is sometimes compared to label smoothing because both improve calibration. The key difference is that temperature scaling operates at inference time — you learn a single temperature parameter that rescales all logits uniformly after training is complete — while label smoothing operates during training and affects what the model learns. If you have access to a held-out calibration set and the ability to apply post-processing at inference time, temperature scaling can achieve better calibration improvement than label smoothing with no training cost. If calibration must be baked into the training process (for example, because inference infrastructure does not support post-processing), label smoothing is the right tool. Using both provides the strongest calibration, but the marginal benefit of adding label smoothing when you plan to apply temperature scaling is smaller than when temperature scaling is not available.

Monitoring Label Smoothing Behaviour During Training

A simple diagnostic to verify label smoothing is having its intended effect is to track the mean maximum predicted probability (mean confidence) on the training set over the course of training. Without label smoothing, this value climbs toward 0.99+ as training progresses and the model fits the training data. With label smoothing at ε=0.1, the maximum predicted probability should stabilise below approximately 0.9 on well-represented classes because the loss function no longer rewards pushing it higher. If the mean confidence is still climbing toward 1.0 with smoothing enabled, it may indicate the smoothing value is too low, the model has enough capacity to route around the regularisation, or the implementation has a bug in how the smoothed targets are constructed. Logging this metric alongside training loss and validation accuracy adds a useful sanity check that costs nothing to implement and makes the training dynamics of label smoothing visible at a glance.

Leave a Comment