Curriculum learning organises training data from easy examples to hard ones rather than sampling uniformly at random. The intuition comes from human learning — we learn arithmetic before calculus, not because calculus is impossible first but because the ordering accelerates understanding and reduces confusion. For neural networks, training on a carefully ordered curriculum can speed convergence, improve final accuracy on hard examples, and stabilise training on noisy or imbalanced datasets. This article covers how to implement curriculum learning in PyTorch, how to design difficulty measures for different tasks, and how self-paced learning extends the idea by letting the model choose its own curriculum.
The Core Idea: Difficulty Ordering
Standard stochastic gradient descent samples each minibatch uniformly from the training set. Curriculum learning replaces the uniform sampler with one that orders examples by difficulty, starting with easy examples and gradually introducing harder ones as training progresses. The key design decisions are: (1) how to define and measure difficulty, and (2) how to schedule the transition from easy to hard. Both are task-dependent, but the implementation pattern in PyTorch is consistent across tasks.
import torch
from torch.utils.data import Dataset, DataLoader, Sampler
import numpy as np
class CurriculumSampler(Sampler):
"""
Samples from a dataset ordered by difficulty score.
In early epochs, samples only from the easiest fraction of the dataset.
The competence (fraction available) increases linearly each epoch until
all examples are accessible.
"""
def __init__(self, difficulty_scores: np.ndarray,
initial_competence: float = 0.3,
final_competence: float = 1.0,
total_epochs: int = 50,
current_epoch: int = 0,
batch_size: int = 32):
"""
difficulty_scores: (N,) — higher = harder. Pre-computed before training.
initial_competence: fraction of dataset accessible at epoch 0
final_competence: fraction accessible at total_epochs
"""
self.scores = difficulty_scores
self.n = len(difficulty_scores)
self.initial_c = initial_competence
self.final_c = final_competence
self.total_epochs = total_epochs
self.current_epoch = current_epoch
self.batch_size = batch_size
# Sort indices from easiest to hardest once
self.sorted_indices = np.argsort(difficulty_scores) # ascending = easy first
def set_epoch(self, epoch: int):
self.current_epoch = epoch
def _competence(self) -> float:
"""Linear schedule: ramp from initial_c to final_c over total_epochs."""
t = min(self.current_epoch / max(self.total_epochs, 1), 1.0)
return self.initial_c + t * (self.final_c - self.initial_c)
def __iter__(self):
c = self._competence()
n_available = max(self.batch_size, int(self.n * c))
# Take the n_available easiest examples, then shuffle among them
available = self.sorted_indices[:n_available]
shuffled = available[np.random.permutation(len(available))]
return iter(shuffled.tolist())
def __len__(self):
c = self._competence()
return max(self.batch_size, int(self.n * c))
Difficulty Measures for Common Tasks
The definition of difficulty is where curriculum learning is task-specific. Three common approaches cover most supervised learning settings.
For classification, the simplest difficulty measure is the model’s own loss on each example — examples with high cross-entropy loss are hard, examples with low loss are easy. This requires a pre-training pass or an earlier checkpoint to score examples before the curriculum sampler is used. A proxy measure that avoids this is example length or input complexity: shorter sentences are easier in NLP, lower-resolution or less cluttered images are easier in vision.
def score_by_model_loss(model, dataset, device='cuda', batch_size=256) -> np.ndarray:
"""Score each training example by the model's current loss — higher = harder."""
model.eval()
loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
all_losses = []
with torch.no_grad():
for inputs, targets in loader:
inputs, targets = inputs.to(device), targets.to(device)
logits = model(inputs)
loss = torch.nn.functional.cross_entropy(logits, targets, reduction='none')
all_losses.extend(loss.cpu().numpy())
return np.array(all_losses)
def score_by_length(texts: list[str]) -> np.ndarray:
"""For NLP: score by token count — longer = harder."""
return np.array([len(t.split()) for t in texts], dtype=float)
def score_by_label_noise(labels: np.ndarray,
soft_labels: np.ndarray) -> np.ndarray:
"""
Score by disagreement between hard labels and a soft label estimate.
High disagreement = potentially mislabelled = treat as hard/uncertain.
"""
one_hot = np.eye(soft_labels.shape[1])[labels]
kl = np.sum(one_hot * np.log(one_hot / (soft_labels + 1e-9) + 1e-9), axis=1)
return kl
Full Training Loop with Curriculum Scheduling
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
def train_with_curriculum(model, dataset, difficulty_scores,
n_epochs=50, batch_size=64,
initial_competence=0.3, device='cuda'):
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()
sampler = CurriculumSampler(
difficulty_scores=difficulty_scores,
initial_competence=initial_competence,
final_competence=1.0,
total_epochs=n_epochs,
batch_size=batch_size,
)
loader = DataLoader(dataset, batch_sampler=None, sampler=sampler,
batch_size=batch_size, drop_last=True)
for epoch in range(n_epochs):
sampler.set_epoch(epoch) # advance the competence schedule
model.train()
total_loss = 0.0
for inputs, targets in loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
total_loss += loss.item()
n_available = len(sampler)
print(f"Epoch {epoch+1}/{n_epochs} | "
f"Loss: {total_loss/len(loader):.4f} | "
f"Training on {n_available}/{len(dataset)} examples "
f"({100*n_available/len(dataset):.0f}%)")
return model
Self-Paced Learning: The Model Chooses Its Own Curriculum
Self-paced learning (SPL) removes the need to pre-define a difficulty measure. Instead, a weighting variable is introduced for each training example, and the model jointly optimises the task objective and the selection of which examples to include in each update. Examples with high loss are automatically down-weighted or excluded, and the threshold for inclusion relaxes over time. The result is an adaptive curriculum where the model’s own current capability determines what is easy or hard — no external difficulty oracle is needed.
class SelfPacedLoss(nn.Module):
def __init__(self, lam=1.0, growth_rate=1.05):
super().__init__()
self.lam = lam
self.growth_rate = growth_rate
def step_epoch(self):
self.lam *= self.growth_rate
def forward(self, per_sample_losses):
weights = (per_sample_losses < self.lam).float()
n_selected = weights.sum()
if n_selected == 0:
return per_sample_losses.mean()
return (weights * per_sample_losses).sum() / n_selected
spl = SelfPacedLoss(lam=0.5, growth_rate=1.03)
base_criterion = nn.CrossEntropyLoss(reduction='none')
for epoch in range(n_epochs):
spl.step_epoch()
for inputs, targets in loader:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
logits = model(inputs)
per_sample_loss = base_criterion(logits, targets)
loss = spl(per_sample_loss)
loss.backward()
optimizer.step()
Choosing the initial lambda requires a rough sense of what loss values your model produces early in training. A practical heuristic: run one epoch with no curriculum, collect the per-sample loss distribution, and set initial lambda to the 30th to 40th percentile. This ensures roughly 30-40% of the dataset is included initially, matching the typical initial_competence=0.3 starting point for standard curriculum learning.
Curriculum Learning for LLM Fine-Tuning
For instruction fine-tuning of LLMs, difficulty can be operationalised as the complexity of the instruction-response pair. Shorter, simpler instructions with factual responses are easy; multi-step reasoning tasks, code generation, and long-form structured outputs are hard. A practical curriculum orders examples by response length as a proxy for complexity, domain specialisation (general knowledge before technical domains), and format complexity (plain text before JSON or code).
def score_instruction_difficulty(examples):
scores = []
for ex in examples:
instruction = ex['instruction']
response = ex['response']
length_score = len(response.split()) / 500.0
instruction_score = len(instruction.split()) / 100.0
code_score = 2.0 if '```' in response else 0.0
json_score = 1.5 if response.strip().startswith('{') else 0.0
reasoning_words = ['step by step','explain why','compare','analyse']
reasoning_score = 1.0 if any(w in instruction.lower() for w in reasoning_words) else 0.0
scores.append(length_score + instruction_score + code_score + json_score + reasoning_score)
return np.array(scores)
difficulty = score_instruction_difficulty(train_examples)
sampler = CurriculumSampler(difficulty, initial_competence=0.25,
final_competence=1.0, total_epochs=3)
When Curriculum Learning Helps and When It Does Not
Curriculum learning has the most consistent benefit in three settings. The first is training on noisy labels: starting with low-loss examples and only gradually admitting high-loss ones reduces the influence of mislabelled examples early in training when the model is most susceptible to fitting noise. The second is class-imbalanced datasets where the minority class is inherently harder — a curriculum that initially over-represents easy majority-class examples before introducing the harder minority-class examples tends to produce a better-initialised model. The third is multi-task learning where tasks vary substantially in difficulty — ordering by task difficulty avoids the loss being dominated by the hardest task throughout training, which can prevent convergence on easier tasks.
Curriculum learning reliably does not help when the training data is already clean and balanced, the task difficulty is uniformly distributed, or training converges quickly regardless. For small datasets with a few thousand examples, the curriculum effect is usually negligible because the model sees the full dataset many times anyway. For tasks where all examples have roughly equal difficulty, the additional complexity of a curriculum sampler adds overhead without measurable benefit. The technique is worth reaching for when you have training instability, noisy labels, or extreme difficulty variation in your dataset — not as a default addition to every training run.
Competence Schedules: Linear, Root, and Step
The linear competence schedule used in the sampler above is the most common choice, but two alternatives are worth knowing. A square-root schedule ramps up competence faster initially and more slowly later, which works well when the easy examples are very numerous and you want to move past them quickly: competence = initial_c + (1 - initial_c) * sqrt(t) where t is the fraction of training elapsed. A step schedule uses discrete jumps — for example, training on the easiest 30% for the first 20% of epochs, then adding the next 30% for the following 20%, and so on. Step schedules are simpler to reason about and debug because you know exactly which examples are in scope at each phase, at the cost of abrupt transitions that can occasionally cause brief loss spikes when a new tranche of harder examples enters the training pool.
The best schedule depends on how sharp the difficulty distribution is in your dataset. If there is a clear bimodal difficulty distribution — a large body of easy examples and a smaller tail of hard ones — a step schedule with an explicit hard-example phase at the end tends to work well. If difficulty is more continuously distributed, the linear or root schedule adapts more smoothly. In practice, the schedule choice matters less than the initial competence and the total ramp duration: starting at 0.25–0.35 and reaching full competence by epoch 30–50% of total training is a robust default that works across a wide range of tasks without much tuning.
Curriculum Learning vs Data Augmentation and Resampling
Curriculum learning is sometimes confused with oversampling minority classes or applying stronger augmentation to hard examples. The distinction matters for understanding when to use each. Resampling changes the data distribution the model sees throughout training — the minority class is seen more frequently at every epoch. Curriculum learning changes the temporal ordering of what the model sees — the minority class may be seen at the same total frequency, but concentrated in later epochs rather than uniformly. The two approaches address different problems and can be combined: resample to fix class balance, then apply a curriculum to order examples by difficulty within the balanced dataset.
Data augmentation for hard examples is closest to OHEM in spirit — it concentrates compute on difficult regions of the input space. Curriculum learning is distinct in that it gates access to difficult examples by training time rather than weighting them by loss. The practical implication: if your hard examples need more augmentation to be learnable, apply augmentation. If your hard examples are simply too far from the model's current capability to learn from usefully, curriculum scheduling is the right tool. When in doubt, profile your training loss by difficulty quintile: if the loss on the hardest 20% of examples stays flat or increases while easier examples converge, that is a signal that curriculum ordering would help.
Tracking Curriculum Progress in Practice
Logging which examples are in scope at each epoch, and tracking the loss separately for easy and hard examples, gives you the observability needed to tune the curriculum schedule. Without this instrumentation, it is difficult to tell whether the curriculum is having the intended effect or whether hard examples are simply not converging regardless of when they are introduced.
def evaluate_by_difficulty_quartile(model, dataset, difficulty_scores, device='cuda'):
"""Compute loss separately for each difficulty quartile."""
model.eval()
quartile_boundaries = np.percentile(difficulty_scores, [25, 50, 75])
quartile_indices = [
np.where(difficulty_scores <= quartile_boundaries[0])[0],
np.where((difficulty_scores > quartile_boundaries[0]) & (difficulty_scores <= quartile_boundaries[1]))[0],
np.where((difficulty_scores > quartile_boundaries[1]) & (difficulty_scores <= quartile_boundaries[2]))[0],
np.where(difficulty_scores > quartile_boundaries[2])[0],
]
criterion = nn.CrossEntropyLoss()
results = {}
for q, indices in enumerate(quartile_indices):
subset = torch.utils.data.Subset(dataset, indices)
loader = DataLoader(subset, batch_size=256, shuffle=False)
total_loss = 0.0
with torch.no_grad():
for inputs, targets in loader:
inputs, targets = inputs.to(device), targets.to(device)
loss = criterion(model(inputs), targets)
total_loss += loss.item()
results[f'Q{q+1}_loss'] = total_loss / len(loader)
return results
# Log quartile losses at each epoch
for epoch in range(n_epochs):
# ... training ...
quartile_losses = evaluate_by_difficulty_quartile(model, val_dataset, difficulty_scores)
print(f"Epoch {epoch}: {quartile_losses}")
# Q1 (easiest) should converge first; Q4 (hardest) later
This quartile breakdown lets you verify that the curriculum is working as intended: Q1 and Q2 loss should decrease first, with Q3 and Q4 following as those examples enter the training pool. If Q4 loss never decreases, the examples may be too hard for the model regardless of curriculum ordering — which is useful information about data quality rather than a curriculum failure.
Curriculum Learning in Contrastive and Embedding Training
Curriculum learning is particularly effective for contrastive training of embedding models, where the concept of difficulty maps directly to the hardness of negative examples. In standard contrastive training, negatives are sampled at random from the batch. Easy negatives are far from the anchor in embedding space and contribute negligible gradient signal once the model is past the earliest training stages. Hard negatives — examples that are semantically different from the anchor but close in embedding space — provide the most useful gradient but can destabilise training if introduced before the embeddings have any meaningful structure.
A curriculum for contrastive training therefore starts with in-batch random negatives (easy), then introduces mined hard negatives from a nearest-neighbour index after a warm-up period. The warm-up duration is the key hyperparameter: too short and hard negatives collapse training before useful structure forms; too long and easy negatives waste training budget on uninformative gradients. A reasonable default is to use random negatives for the first 5-10% of training steps, then switch to a mix of random and hard negatives with the hard negative fraction ramping from 0 to 0.5 over the next 20% of training. This curriculum is essentially what is used in state-of-the-art text embedding training pipelines.
Combining Curriculum with Gradient Accumulation
One practical consideration when using curriculum learning with gradient accumulation is that the effective batch diversity changes as competence increases. Early in training with low competence, all examples in an accumulated macro-batch come from the easy subset of the dataset. This reduces the gradient variance within each macro-batch, which generally accelerates convergence on easy examples but can cause the optimizer's momentum estimates to be poorly initialised for the transition to harder examples. A simple fix is to maintain a minimum diversity requirement: even at low competence, sample at least one hard example per accumulation step to keep the optimizer's statistics calibrated across the difficulty range. This does not significantly slow early convergence but prevents the sharp loss spike that can occur when hard examples are suddenly introduced after a long easy-only warm-up.
Curriculum learning is one of the lower-risk training improvements available because it does not change model architecture, loss function, or optimizer — it only changes the data ordering. If it does not help for your task, rolling it back is as simple as switching the DataLoader back to a standard random sampler. That low cost of experimentation makes it worth trying whenever you are dealing with heterogeneous training data, long training runs where convergence speed matters, or datasets with a known mix of clean and noisy labels.