Text Data Augmentation for LLM Training: Techniques That Actually Work
A practical guide to text data augmentation for ML engineers: why text augmentation is harder than image augmentation, word-level perturbations with EDA, back-translation with MarianMT, LLM-based paraphrasing for instruction datasets, embedding-space Mixup for classification, and how to verify empirically that augmentation is actually helping rather than hurting model quality.