Sequence Packing for LLM Training: Eliminating Padding Waste

When you train an LLM with a fixed context length, most batches contain sequences shorter than that maximum — and the padding tokens added to fill each sequence to the maximum length are wasted compute. On typical instruction-tuning datasets where sequences average 200–400 tokens against a 2,048-token context, padding waste can exceed 70% of total token capacity per batch. Sequence packing eliminates this waste by concatenating multiple short sequences end-to-end until they fill the context window, replacing padding with actual training signal. The throughput improvement is proportional to padding waste: a dataset with 75% padding gets roughly 4x the effective tokens per batch after packing, directly reducing training time and cost by the same factor.

The Padding Problem

Standard DataLoader batching pads every sequence in a batch to the length of the longest sequence in that batch. With variable-length training data, this means a batch containing one long sequence and many short ones wastes most of its compute on padding. Even with dynamic padding — padding to the longest sequence in each batch rather than the global maximum — highly skewed length distributions still produce significant waste. The problem compounds at scale: training a 7B model for a day with 75% padding is equivalent to training for a quarter of a day with none.

import numpy as np
from datasets import Dataset
from transformers import AutoTokenizer

def analyse_padding_waste(
    dataset: Dataset,
    tokenizer,
    max_length: int = 2048,
    sample_size: int = 1000,
) -> dict:
    """Estimate padding waste for a dataset at a given max_length."""
    lengths = []
    for i, example in enumerate(dataset):
        if i >= sample_size:
            break
        tokens = tokenizer(example["text"], truncation=True, max_length=max_length)
        lengths.append(len(tokens["input_ids"]))

    lengths = np.array(lengths)
    total_capacity = len(lengths) * max_length
    total_real_tokens = lengths.sum()
    padding_waste = 1.0 - total_real_tokens / total_capacity

    print(f"Mean sequence length:   {lengths.mean():.0f} tokens")
    print(f"Median sequence length: {np.median(lengths):.0f} tokens")
    print(f"Max sequence length:    {lengths.max():.0f} tokens")
    print(f"Padding waste:          {padding_waste:.1%}")
    print(f"Effective token ratio:  {1 - padding_waste:.1%}")
    print(f"Packing speedup (est):  {1 / (1 - padding_waste):.2f}x")
    return {"mean": lengths.mean(), "waste": padding_waste, "lengths": lengths}

# Example: analyse an instruction dataset
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
# results = analyse_padding_waste(your_dataset, tokenizer, max_length=2048)

Greedy Sequence Packing

The simplest packing implementation concatenates sequences one by one, inserting an EOS token between them, until the running length would exceed the maximum context length, then starts a new packed sequence. This greedy approach does not produce perfectly optimal packing (that would require solving a bin-packing problem) but achieves 90–95% of the theoretical maximum utilisation in practice and runs in linear time.

from typing import List, Dict
import torch
from torch.utils.data import Dataset

def greedy_pack_sequences(
    token_ids_list: List[List[int]],
    max_length: int = 2048,
    eos_token_id: int = 2,
    pad_token_id: int = 0,
) -> List[Dict[str, List[int]]]:
    """Pack variable-length sequences into fixed-length chunks.
    
    Returns list of dicts with:
      - input_ids: packed token ids, padded to max_length
      - attention_mask: 1 for real tokens, 0 for padding
      - labels: same as input_ids but -100 for padding (ignored in loss)
      - sequence_ids: which original sequence each position belongs to
    """
    packed_examples = []
    current_ids = []
    current_seq_ids = []
    seq_id = 0

    for token_ids in token_ids_list:
        # Always append EOS between sequences
        seq_with_eos = token_ids + [eos_token_id]
        if len(current_ids) + len(seq_with_eos) > max_length:
            if current_ids:
                # Finalise current pack with padding
                pad_len = max_length - len(current_ids)
                packed_examples.append({
                    "input_ids": current_ids + [pad_token_id] * pad_len,
                    "attention_mask": [1] * len(current_ids) + [0] * pad_len,
                    "labels": current_ids + [-100] * pad_len,
                    "sequence_ids": current_seq_ids + [-1] * pad_len,
                })
            current_ids = []
            current_seq_ids = []
        current_ids.extend(seq_with_eos)
        current_seq_ids.extend([seq_id] * len(seq_with_eos))
        seq_id += 1

    # Don't forget the last partial pack
    if current_ids:
        pad_len = max_length - len(current_ids)
        packed_examples.append({
            "input_ids": current_ids + [pad_token_id] * pad_len,
            "attention_mask": [1] * len(current_ids) + [0] * pad_len,
            "labels": current_ids + [-100] * pad_len,
            "sequence_ids": current_seq_ids + [-1] * pad_len,
        })
    return packed_examples

class PackedDataset(Dataset):
    def __init__(self, packed_examples: List[Dict]):
        self.examples = packed_examples

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        ex = self.examples[idx]
        return {k: torch.tensor(v) for k, v in ex.items() if k != "sequence_ids"}

The Attention Leakage Problem and How to Fix It

Naive packing has a critical correctness issue: standard causal attention allows each token to attend to all preceding tokens in the context window, regardless of which original sequence they came from. When two sequences are packed end-to-end, tokens in the second sequence can attend to tokens from the first sequence, which pollutes the training signal — the model learns spurious cross-sequence dependencies that do not exist at inference time. This attention leakage is subtle but measurable: models trained with naive packing show slightly worse perplexity on held-out data than models trained without packing on the same data, because they have learned to exploit cross-sequence context that is never available at inference time.

The fix is document-aware attention masking, also called intra-document masking or sample-level attention masking: each token’s attention is restricted to tokens from the same original sequence. This requires modifying the attention mask to be a 2D matrix rather than a 1D sequence mask, marking cross-sequence positions as masked.

import torch

def build_document_attention_mask(sequence_ids: torch.Tensor) -> torch.Tensor:
    """Build a causal attention mask that prevents cross-sequence attention.
    
    sequence_ids: (seq_len,) tensor where same-sequence tokens share an ID,
                  padding positions have ID -1.
    Returns: (seq_len, seq_len) bool mask, True = attend, False = mask out.
    """
    seq_len = sequence_ids.shape[0]
    # Standard causal mask: can only attend to current and previous positions
    causal = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool))
    # Document mask: can only attend within the same sequence
    doc_mask = sequence_ids.unsqueeze(0) == sequence_ids.unsqueeze(1)
    # Padding tokens (id=-1) should not attend or be attended to
    pad_mask = (sequence_ids != -1).unsqueeze(0) & (sequence_ids != -1).unsqueeze(1)
    return causal & doc_mask & pad_mask

# Using with HuggingFace models via custom attention_mask
def pack_batch_with_document_mask(packed_batch: dict) -> dict:
    """Add document-aware 2D attention masks to a packed batch."""
    bs = packed_batch["input_ids"].shape[0]
    seq_len = packed_batch["input_ids"].shape[1]
    masks_2d = []
    for i in range(bs):
        seq_ids = packed_batch["sequence_ids"][i]
        mask_2d = build_document_attention_mask(seq_ids)
        masks_2d.append(mask_2d)
    # (bs, seq_len, seq_len) — pass as attention_mask to model
    packed_batch["attention_mask"] = torch.stack(masks_2d).float()
    return packed_batch

Flash Attention natively supports document-aware masking via the cu_seqlens (cumulative sequence lengths) argument, which marks the boundaries between packed sequences and automatically restricts attention to within-document positions. When using the Hugging Face flash_attention_2 implementation, packing with cu_seqlens is both faster and correct out of the box without requiring manual 2D mask construction. The TRL library’s DataCollatorForCompletionOnlyLM with packing enabled uses this approach, and it is the recommended way to implement packing when training with FlashAttention.

Packing with the TRL and HuggingFace Ecosystem

The simplest production-ready packing implementation for LLM fine-tuning is TRL’s built-in support, which handles packing, EOS insertion, and document masking automatically.

from trl import SFTConfig, SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    torch_dtype="bfloat16",
    attn_implementation="flash_attention_2",  # required for correct packing
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer.pad_token = tokenizer.eos_token
dataset = load_dataset("tatsu-lab/alpaca", split="train")

training_args = SFTConfig(
    output_dir="./output",
    max_seq_length=2048,
    packing=True,               # enable sequence packing
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    bf16=True,
    logging_steps=10,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
)
trainer.train()

With packing=True and attn_implementation="flash_attention_2", TRL automatically uses cu_seqlens-based document masking, so cross-sequence attention is correctly blocked. Without flash_attention_2, packing still works but uses naive causal masking — the attention leakage issue described above applies, and you should benchmark perplexity against an unpacked baseline to quantify the quality impact before deciding whether to accept it.

Measuring Packing Efficiency and Quality

After implementing packing, verify two things: first, that packing efficiency (proportion of non-padding tokens across the packed dataset) is close to your theoretical estimate from the padding analysis; second, that model quality on a held-out evaluation set is not degraded relative to training without packing. Packing with correct document masking should produce identical or marginally better model quality than training without packing on the same number of gradient steps, because it processes more real tokens per step. If you see a quality regression, the most likely cause is incorrect document masking — tokens attending across sequence boundaries — which you can diagnose by checking whether the regression disappears when you disable packing.

Track tokens-per-second (real tokens, not including padding) as your training throughput metric rather than sequences-per-second or steps-per-second. This gives a hardware-agnostic measure of how efficiently you are using your GPU compute, and makes it straightforward to compare packing versus no-packing runs, different context lengths, and different batch size configurations on equal footing.

Sorting and Bucketing for Maximum Packing Efficiency

Greedy packing on a randomly shuffled dataset leaves packing efficiency on the table because sequences of very different lengths are unlikely to fill a context window cleanly when paired together — a 1,900-token sequence paired with a 200-token sequence wastes 100 tokens on padding, but two 1,024-token sequences fill the window exactly. Sorting sequences by length before packing significantly improves bin utilisation: similar-length sequences are packed together, reducing the tail waste at the end of each bin. The tradeoff is that sorting eliminates the randomisation of the training order within each epoch, which can slow convergence on datasets where the length distribution correlates with content (long sequences often come from specific domains or task types).

The standard solution is bucket-based packing: group sequences into length buckets (for example, 0–512, 512–1024, 1024–2048 tokens), shuffle within each bucket, and pack within each bucket. This preserves within-bucket randomisation while dramatically improving packing efficiency compared to fully random packing. A bucket width of 256–512 tokens typically achieves 95%+ packing efficiency on most instruction-tuning datasets while maintaining sufficient shuffle diversity to avoid systematic ordering effects during training.

from typing import List, Dict, Tuple
import random

def bucket_pack_sequences(
    token_ids_list: List[List[int]],
    max_length: int = 2048,
    bucket_width: int = 512,
    eos_token_id: int = 2,
    pad_token_id: int = 0,
    seed: int = 42,
) -> List[Dict]:
    """Pack sequences using length bucketing for higher efficiency.
    
    Groups sequences into buckets of similar length, shuffles within buckets,
    then packs within each bucket to minimise padding waste.
    """
    random.seed(seed)
    # Sort into length buckets
    n_buckets = (max_length // bucket_width) + 1
    buckets: Dict[int, List] = {i: [] for i in range(n_buckets)}
    for token_ids in token_ids_list:
        bucket_idx = min(len(token_ids) // bucket_width, n_buckets - 1)
        buckets[bucket_idx].append(token_ids)
    # Shuffle within each bucket
    for bucket in buckets.values():
        random.shuffle(bucket)
    # Flatten buckets back to a list (or interleave for more diversity)
    sorted_sequences = []
    for bucket in buckets.values():
        sorted_sequences.extend(bucket)
    # Now greedy-pack the sorted sequences
    return greedy_pack_sequences(sorted_sequences, max_length, eos_token_id, pad_token_id)

# Compare efficiency: random packing vs bucketed packing
import numpy as np
def measure_packing_efficiency(packed: List[Dict]) -> float:
    real_tokens = sum(sum(1 for t in ex["attention_mask"] if t == 1) for ex in packed)
    total_tokens = sum(len(ex["attention_mask"]) for ex in packed)
    return real_tokens / total_tokens

Packing Completion-Only Datasets

For instruction-following fine-tuning, you typically want to compute loss only on the completion (response) tokens, not the instruction tokens. Sequence packing interacts with this in a non-obvious way: after packing, you need to track which token positions in each packed sequence correspond to completions versus instructions, and set the label to -100 for all instruction token positions. The TRL library handles this correctly when you use its DataCollatorForCompletionOnlyLM, which marks instruction tokens with -100 before packing so they are automatically excluded from the loss regardless of how they end up positioned in the packed context. When implementing packing manually, you must replicate this behaviour: build labels with -100 for instruction tokens before packing, then concatenate labels alongside input_ids during the packing step to ensure label assignments are preserved correctly through the packing transformation.

When Packing Is Not Worth It

Packing adds implementation complexity and has a correctness risk (attention leakage) that vanilla padding training does not have. It is clearly worth the effort when padding waste exceeds 40–50% — at that level, the throughput gain is large enough to justify the complexity. Below 20–30% padding waste, the engineering cost often exceeds the training cost savings, particularly for short training runs or when using cloud spot instances where the marginal GPU hour cost is low. Datasets that are already close to the maximum context length — pretraining on long documents, or fine-tuning with very long instruction-response pairs — have negligible padding waste and gain almost nothing from packing. Datasets with extremely variable length distributions (some sequences 50 tokens, others 1,900 tokens) benefit most from packing and also benefit the most from bucket-based sorting before packing. When in doubt, run the padding waste analysis first — if waste is below 30%, spend the engineering time elsewhere.

Packing and Gradient Accumulation: Getting the Math Right

Sequence packing changes the effective batch size in a way that interacts with gradient accumulation in a subtle but important manner. With standard padding, each training step processes batch_size × max_length token positions, but only a fraction are real tokens contributing to the loss. With packing, nearly all positions are real tokens, so the gradient per step represents a larger and more consistent amount of signal. This means that if you were previously using gradient accumulation to reach an effective batch of N sequences, switching to packing gives you a larger effective token batch without changing the step count — effectively a larger learning rate per real token. In practice, this means you should monitor your loss curve carefully after enabling packing and be prepared to reduce the learning rate or increase gradient accumulation steps if you see instability or faster-than-expected convergence that leads to overfitting.

The interaction also affects how you should think about epoch counts. With packing, one epoch over the dataset visits every token roughly once, just as without packing — but because each gradient step covers more real tokens, you may see faster convergence in terms of steps. A training run that needed 3 epochs without packing might converge in 2 with packing at the same learning rate and batch size, simply because each step carries more signal. This is generally desirable, but it means that hyperparameters tuned without packing do not transfer directly to a packed training run and benefit from at least a brief validation sweep.

Multi-Node Packing Considerations

In distributed training across multiple GPUs or nodes, packing must be applied consistently across all ranks. If you pack the dataset on a single process before distributing it, each rank receives pre-packed sequences and the distributed training proceeds normally. If you pack on-the-fly in a DistributedSampler setup, ensure that the packing logic is deterministic and seeded so all ranks pack the same sequences in the same way — inconsistent packing across ranks produces batches with different token counts per rank, which causes gradient synchronisation errors with FSDP or DDP when loss scaling assumes uniform batch sizes. The safest approach is to pre-pack the entire dataset once, save it to disk as a Hugging Face Dataset, and use standard distributed sampling on the pre-packed dataset. This adds a one-time preprocessing cost but eliminates all distributed packing edge cases during training.

Leave a Comment