How to Filter and Deduplicate Pretraining Data for LLMs
A practical guide to LLM pretraining data pipelines: language identification with FastText, heuristic quality filtering using character-to-word ratios, symbol ratios, and repeated line detection, perplexity-based filtering with KenLM to catch templated and garbled text, MinHash LSH deduplication with datasketch, exact substring deduplication with suffix arrays, building a full pipeline with HuggingFace datatrove including Gopher and C4 quality filters, training a fastText classifier for quality scoring, and balancing the data mix across web, books, and code sources.