How to Filter and Deduplicate Pretraining Data for LLMs

The quality of pretraining data matters more than its quantity. A model trained on 100B tokens of carefully filtered text outperforms one trained on 1T tokens of raw web crawl on most downstream benchmarks. This is the core lesson from data ablation studies in the LLaMA, Falcon, and Dolma papers. For practitioners building domain-specific pretraining or continued pretraining pipelines, the data pipeline — filtering, deduplication, quality scoring — is where most of the leverage is. This article covers the key filtering steps, how to implement them, and how to build a reproducible pipeline with the datatrove library.

The Pretraining Data Pipeline: Overview

A standard pretraining data pipeline has four stages: acquisition (downloading web crawls or domain-specific corpora), language filtering (keeping only target-language documents), quality filtering (removing low-quality, boilerplate, or toxic content), and deduplication (removing near-duplicate documents). Each stage removes a significant fraction of the raw data — a typical pipeline reduces a raw Common Crawl dump by 70-90% before it is usable for pretraining. The order matters: deduplication after quality filtering is more efficient than before, because low-quality documents tend to be heavily duplicated (SEO spam, boilerplate text) and filtering them first shrinks the deduplication problem.

Language Identification

For most pretraining pipelines targeting English, the first filter is language identification. Common Crawl contains text in hundreds of languages; filtering to a single language removes the noise that multi-lingual text introduces for a monolingual model. FastText’s language identification model (lid.176.bin) is the standard tool — it runs at millions of documents per second and achieves high accuracy on web text.

import fasttext
import os

class LanguageFilter:
    """
    Filter documents by detected language using FastText lid model.
    Download: https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
    """
    def __init__(self, model_path: str, target_lang: str = 'en',
                 min_confidence: float = 0.65):
        self.model = fasttext.load_model(model_path)
        self.target_lang = target_lang
        self.min_confidence = min_confidence

    def predict(self, text: str) -> tuple[str, float]:
        # FastText expects single-line input
        clean = text.replace('\n', ' ').strip()
        labels, scores = self.model.predict(clean[:512], k=1)
        lang = labels[0].replace('__label__', '')
        return lang, float(scores[0])

    def keep(self, text: str) -> bool:
        lang, conf = self.predict(text)
        return lang == self.target_lang and conf >= self.min_confidence

# Usage
lang_filter = LanguageFilter('lid.176.bin', target_lang='en', min_confidence=0.65)
docs = [d for d in raw_docs if lang_filter.keep(d['text'])]

Heuristic Quality Filtering

After language filtering, heuristic quality filters remove documents that are technically in the right language but are not useful training data: very short documents, documents with high symbol-to-word ratios (navigation menus, code-heavy pages with no prose), documents with abnormally high or low character-to-word ratios (indicating CJK text that passed the language filter, or garbled encoding), and documents that are mostly repeated lines (boilerplate, footers, disclaimers copied many times).

import re
from collections import Counter

def heuristic_quality_score(text: str) -> dict:
    """Compute heuristic quality signals for a document."""
    lines  = text.splitlines()
    words  = text.split()
    chars  = len(text)
    n_words = len(words)

    if n_words == 0:
        return {'keep': False, 'reason': 'empty'}

    # Basic length filter
    if n_words < 50:
        return {'keep': False, 'reason': 'too_short'}
    if n_words > 100_000:
        return {'keep': False, 'reason': 'too_long'}

    # Character-to-word ratio (detects non-Latin scripts, garbled text)
    char_word_ratio = chars / n_words
    if char_word_ratio < 3 or char_word_ratio > 15:
        return {'keep': False, 'reason': 'bad_char_word_ratio'}

    # Symbol ratio: fraction of non-alphanumeric, non-space chars
    n_symbols = sum(1 for c in text if not c.isalnum() and not c.isspace())
    symbol_ratio = n_symbols / chars
    if symbol_ratio > 0.15:
        return {'keep': False, 'reason': 'high_symbol_ratio'}

    # Repeated line ratio: fraction of lines that are duplicates
    line_counts = Counter(l.strip() for l in lines if l.strip())
    n_lines = sum(line_counts.values())
    n_unique = len(line_counts)
    repeat_ratio = 1 - (n_unique / max(n_lines, 1))
    if repeat_ratio > 0.3:
        return {'keep': False, 'reason': 'high_repeat_ratio'}

    # Bullet/list ratio: too many short lines suggests navigation/menu
    short_line_ratio = sum(1 for l in lines if 0 < len(l.strip()) < 30) / max(len(lines), 1)
    if short_line_ratio > 0.6:
        return {'keep': False, 'reason': 'mostly_short_lines'}

    return {'keep': True, 'char_word_ratio': char_word_ratio,
            'symbol_ratio': symbol_ratio, 'repeat_ratio': repeat_ratio}

Perplexity-Based Quality Filtering

Heuristic filters catch the obvious junk, but they miss subtler quality problems: grammatically correct but semantically empty text, machine-translated content, SEO-optimised keyword stuffing. Perplexity filtering — scoring each document with a small reference language model and removing documents with very high or very low perplexity — catches these cases. High perplexity indicates text that is far from natural language (garbled, non-English, or highly technical without context). Unusually low perplexity indicates templated or highly repetitive text that looks fluent but carries little information. KenLM trained on Wikipedia is the standard reference model for this filter.

import kenlm
import math

class PerplexityFilter:
    """
    Filter documents by perplexity under a KenLM language model.
    Reference model typically trained on Wikipedia or a clean corpus.
    Install: pip install https://github.com/kpu/kenlm/archive/master.zip
    """
    def __init__(self, model_path: str,
                 min_perplexity: float = 10,
                 max_perplexity: float = 1000):
        self.model = kenlm.Model(model_path)
        self.min_pp = min_perplexity
        self.max_pp = max_perplexity

    def perplexity(self, text: str) -> float:
        words = text.lower().split()
        if not words:
            return float('inf')
        log_prob = self.model.score(' '.join(words), bos=True, eos=True)
        # KenLM returns log10 prob; convert to per-word perplexity
        n = len(words)
        return 10 ** (-log_prob / n)

    def keep(self, text: str) -> bool:
        pp = self.perplexity(text)
        return self.min_pp <= pp <= self.max_pp

# Calibrate thresholds on a sample before applying to full corpus
def calibrate_perplexity_thresholds(ppl_filter, sample_docs, low_pct=5, high_pct=95):
    import numpy as np
    scores = [ppl_filter.perplexity(d) for d in sample_docs]
    scores = [s for s in scores if s != float('inf')]
    lo = np.percentile(scores, low_pct)
    hi = np.percentile(scores, high_pct)
    print(f"Suggested thresholds: min={lo:.1f}, max={hi:.1f}")
    return lo, hi

MinHash Deduplication

Near-duplicate deduplication removes documents that are very similar to each other — not byte-for-byte identical, but sharing large overlapping n-grams. The standard approach is MinHash LSH (Locality-Sensitive Hashing): each document is represented as a MinHash signature computed from its character n-grams, and documents whose signatures are sufficiently similar are grouped as near-duplicates. Only one document per near-duplicate cluster is kept.

from datasketch import MinHash, MinHashLSH
import re

def get_ngrams(text: str, n: int = 5) -> set:
    """Character n-grams for MinHash computation."""
    text = re.sub(r'\s+', ' ', text.lower()).strip()
    return {text[i:i+n] for i in range(len(text) - n + 1)}

def build_minhash(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for ngram in get_ngrams(text):
        m.update(ngram.encode('utf-8'))
    return m

def deduplicate_corpus(docs: list[dict],
                        text_key: str = 'text',
                        threshold: float = 0.7,
                        num_perm: int = 128) -> list[dict]:
    """
    Remove near-duplicate documents using MinHash LSH.
    threshold: Jaccard similarity above which docs are considered duplicates.
    Returns deduplicated list.
    """
    lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
    kept, seen_ids = [], set()

    for i, doc in enumerate(docs):
        mh = build_minhash(doc[text_key], num_perm)
        result = lsh.query(mh)
        if not result:  # no near-duplicates found
            lsh.insert(str(i), mh)
            kept.append(doc)
        # else: near-duplicate of an already-kept document — skip

    print(f"Kept {len(kept)}/{len(docs)} after dedup ({100*len(kept)/len(docs):.1f}%)")
    return kept

MinHash LSH scales to billions of documents when implemented in a distributed framework like Spark or with the datatrove library's built-in deduplication step. For corpora up to a few hundred million documents, the datasketch implementation above works on a single machine with sufficient RAM — the LSH index for 100M documents with 128 permutations fits in roughly 25-30 GB of memory.

Exact Deduplication with Suffix Arrays

MinHash catches near-duplicates at the document level. Exact substring deduplication — finding and removing repeated passages that appear verbatim across many documents — requires a different approach. The standard tool is suffix array construction on the concatenated corpus, which identifies all repeated substrings of length above a threshold. The Deduplicating LM Data paper showed that even after MinHash deduplication, a significant fraction of training tokens are near-verbatim copies of passages that appear many times in the corpus, and removing these improved LM quality on several benchmarks. For most practical pipelines, MinHash deduplication is sufficient; exact substring deduplication is worth the additional complexity for very large corpora where data quality is critical.

Building a Pipeline with datatrove

datatrove is HuggingFace's library for large-scale data processing pipelines, designed specifically for LLM pretraining data preparation. It handles distributed execution, checkpointing, and the standard filtering steps out of the box.

from datatrove.pipeline.readers import WarcReader, JsonlReader
from datatrove.pipeline.filters import (
    LanguageFilter, GopherQualityFilter, GopherRepetitionFilter,
    C4QualityFilter, URLFilter
)
from datatrove.pipeline.dedup import MinhashDedupSignature, MinhashDedupBuckets, MinhashDedupCluster, MinhashDedupFilter
from datatrove.pipeline.writers import JsonlWriter
from datatrove.executor import LocalPipelineExecutor

# Stage 1: Filter raw WARC files
filter_pipeline = LocalPipelineExecutor(
    pipeline=[
        WarcReader(data_folder='s3://my-bucket/raw-warc/', glob_pattern='*.warc.gz'),
        URLFilter(exclusion_writer=JsonlWriter('data/removed/urls')),
        LanguageFilter(language_threshold=0.65, languages=['en']),
        GopherQualityFilter(min_doc_words=50, max_doc_words=100_000),
        GopherRepetitionFilter(),
        C4QualityFilter(filter_no_terminal_punct=True),
        JsonlWriter('data/filtered/'),
    ],
    tasks=64,
    workers=8,
    logging_dir='logs/filter/',
)
filter_pipeline.run()

# Stage 2: MinHash deduplication (three-stage process)
dedup_sig = LocalPipelineExecutor(
    pipeline=[
        JsonlReader('data/filtered/'),
        MinhashDedupSignature(output_folder='data/dedup_sigs/', num_hashes=112),
    ],
    tasks=64,
)
dedup_sig.run()

dedup_buckets = LocalPipelineExecutor(
    pipeline=[MinhashDedupBuckets(input_folder='data/dedup_sigs/',
                                   output_folder='data/dedup_buckets/')],
    tasks=14,  # one per LSH band
)
dedup_buckets.run()

dedup_cluster = LocalPipelineExecutor(
    pipeline=[MinhashDedupCluster(input_folder='data/dedup_buckets/',
                                   output_folder='data/dedup_clusters/')],
    tasks=1,
)
dedup_cluster.run()

# Stage 3: Apply dedup filter and write final dataset
final_pipeline = LocalPipelineExecutor(
    pipeline=[
        JsonlReader('data/filtered/'),
        MinhashDedupFilter(dupfiles_folder='data/dedup_clusters/'),
        JsonlWriter('data/final/'),
    ],
    tasks=64,
)
final_pipeline.run()

datatrove's built-in filters implement the Gopher quality heuristics from the DeepMind paper and the C4 filters from the T5 paper, which together cover most of the standard quality filtering steps. For custom filters, subclass BaseFilter and implement the filter method — it receives a Document object and returns True to keep or False to remove.

Classifier-Based Quality Filtering

Beyond heuristics, a classifier trained on high-quality versus low-quality text can catch subtler quality issues. The standard approach trains a fastText classifier on positive examples (Wikipedia, books, curated web content) versus negative examples (random web text) and uses the classifier score as a quality threshold. The CCNet pipeline and the FineWeb dataset both use variants of this approach.

import fasttext

def train_quality_classifier(positive_file: str, negative_file: str,
                              output_model: str = 'quality_classifier.bin') -> None:
    """
    Train a fastText binary classifier for document quality.
    positive_file: one document per line, prefixed with __label__positive
    negative_file: one document per line, prefixed with __label__negative
    """
    import subprocess, tempfile, os
    # Combine and shuffle training data
    with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
        train_file = f.name
        with open(positive_file) as pf:
            for line in pf:
                f.write(f'__label__positive {line.strip()[:512]}\n')
        with open(negative_file) as nf:
            for line in nf:
                f.write(f'__label__negative {line.strip()[:512]}\n')
    model = fasttext.train_supervised(
        input=train_file, epoch=3, lr=0.1,
        wordNgrams=2, verbose=2, minCount=1
    )
    model.save_model(output_model)
    os.unlink(train_file)
    print(f"Saved quality classifier to {output_model}")

def apply_quality_classifier(model_path: str, docs: list[dict],
                              threshold: float = 0.7,
                              text_key: str = 'text') -> list[dict]:
    model = fasttext.load_model(model_path)
    kept = []
    for doc in docs:
        text = doc[text_key].replace('\n', ' ').strip()[:512]
        labels, scores = model.predict(text, k=1)
        if labels[0] == '__label__positive' and scores[0] >= threshold:
            kept.append(doc)
    return kept

Data Mix and Domain Balance

Filtering and deduplication determine what data is available; the data mix determines what fraction of training tokens comes from each source. Most pretraining pipelines over-represent high-quality sources relative to their raw size: web crawl text might constitute 70-80% of the available filtered data but only 45-55% of training tokens, with the remainder coming from books, code, scientific papers, and curated web sources. The Llama 3 and Falcon papers both provide their data mix ratios, which serve as useful baselines — the details vary, but the common theme is systematic upweighting of sources that contain dense, high-quality prose and code over raw web text. For domain-specific continued pretraining, the same principle applies: over-represent the target domain relative to its size in the general corpus, typically by repeating domain documents two to three times while mixing in general web text to prevent catastrophic forgetting of broad capabilities.

Tokeniser-Aware Filtering

A less discussed but practically important filtering step is tokeniser-aware quality checking. Some documents look clean by heuristic and perplexity measures but produce very long tokenisations relative to their text length — this happens with documents that contain unusual Unicode characters, emoji-heavy text, or language patterns the tokeniser handles inefficiently. Documents with a high token-to-character ratio inflate the effective training cost per document without adding proportional information. A simple check computes the ratio of token count to character count and removes documents in the top few percentiles of this ratio.

from transformers import AutoTokenizer

def compute_token_char_ratio(text: str, tokenizer) -> float:
    n_tokens = len(tokenizer.encode(text, add_special_tokens=False))
    n_chars  = len(text)
    return n_tokens / max(n_chars, 1)

def filter_by_token_ratio(docs: list[dict],
                           tokenizer_name: str = 'meta-llama/Llama-3.2-3B',
                           max_ratio: float = 0.4,
                           text_key: str = 'text') -> list[dict]:
    """
    Remove documents where token count is unusually high relative to character count.
    max_ratio=0.4 means at most 0.4 tokens per character (English averages ~0.25).
    """
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    kept = []
    for doc in docs:
        ratio = compute_token_char_ratio(doc[text_key], tokenizer)
        if ratio <= max_ratio:
            kept.append(doc)
    print(f"Token ratio filter: kept {len(kept)}/{len(docs)}")
    return kept

PII and Safety Filtering

Pretraining data from the web inevitably contains personally identifiable information (PII) — email addresses, phone numbers, physical addresses, and in some cases social security numbers or medical information embedded in forum posts and leaked documents. Training on PII increases the risk of the model regurgitating it verbatim, which creates legal and privacy exposure. The standard approach is regex-based detection and redaction: identify PII patterns, replace them with placeholder tokens, and keep the document with redacted content rather than removing it entirely.

import re

PII_PATTERNS = [
    (r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '<>'),
    (r'\b(?:\+?1[\s.-]?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',  '<>'),
    (r'\b\d{3}-\d{2}-\d{4}\b',                                       '<>'),
    (r'\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})\b', '<>'),
]

def redact_pii(text: str) -> tuple[str, int]:
    n_redactions = 0
    for pattern, replacement in PII_PATTERNS:
        new_text, n = re.subn(pattern, replacement, text)
        n_redactions += n
        text = new_text
    return text, n_redactions

Tracking Filtering Statistics

A filtering pipeline without statistics is a black box. For each filtering step, tracking the number of documents removed and the reason allows you to diagnose problems quickly — if language filtering removes 40% of what you expected to be an English corpus, something upstream is wrong. Logging per-step removal rates and sampling removed documents for manual inspection is the minimum viable observability for a pretraining data pipeline.

The key metrics to track per pipeline run are: raw document count, documents surviving each filter stage, fraction of tokens surviving versus fraction of documents (these differ significantly when long documents are preferentially removed or kept), and the deduplication rate. A deduplication rate below 5% suggests the corpus is already fairly unique; above 30% suggests significant web spam or template-heavy content that may warrant stricter heuristic filters before deduplication to reduce the dedup job size. Storing these statistics alongside the dataset version in a data card makes it straightforward to reproduce or audit dataset quality decisions later.

Continued Pretraining on Domain Data: Practical Considerations

Most practitioners encounter pretraining data pipelines not in the context of training a model from scratch, but in continued pretraining — taking an existing pretrained model and training it further on a domain-specific corpus to improve its performance on domain tasks before fine-tuning. The data pipeline for continued pretraining is the same as for pretraining from scratch, but with two important differences in how you think about filtering and mixing.

First, the quality bar for domain data can be lower than for general pretraining data, because the model already has strong language priors from its initial training. Documents that are useful for teaching domain knowledge — internal wikis, technical documentation, research papers with dense notation — may score poorly on general heuristic quality filters (high symbol ratio from LaTeX, short lines from code, low perplexity on domain jargon that the reference KenLM model has not seen). Calibrating your filters on a sample of domain data rather than applying general-web thresholds directly prevents you from accidentally filtering out the most valuable domain content.

Second, the mixing ratio between domain data and general data matters significantly for preventing catastrophic forgetting. Training exclusively on domain data causes the model to lose general capabilities at a rate roughly proportional to the fraction of training tokens that are domain-specific. The widely used heuristic is to mix 70-80% domain data with 20-30% general high-quality text (Wikipedia, books, or a curated web sample), which preserves most general capabilities while achieving meaningful domain adaptation. If you have a fixed domain corpus that is small relative to your compute budget, repeating the domain data two to three times is preferable to increasing the domain fraction beyond 80%, since the repetition penalty on generalisation is less severe than extreme domain concentration.

Where to Start

For a new pretraining data pipeline, the practical order of operations is: implement language filtering first (highest ROI, fastest to run), then heuristic quality filters, then perplexity filtering calibrated on a sample of your domain, then MinHash deduplication. Start with datatrove's built-in Gopher and C4 filters rather than implementing your own heuristics — they are well-tested and cover the most common quality signals. Add custom filters only after inspecting a sample of what the default filters accept and reject. Reserve classifier-based quality filtering and suffix array deduplication for a second pass once you have a baseline pipeline running and have profiled where the remaining noise is coming from. The biggest quality gains in practice come from getting the basics right — length filtering, deduplication, and language identification — rather than from sophisticated classifiers applied to a corpus that still contains obvious junk.

Data quality work is iterative rather than one-shot. A reproducible, well-instrumented pipeline gives you the foundation to diagnose problems systematically and improve with each training run.

Leave a Comment