AdamW vs Adafactor vs Lion: Choosing an Optimizer for LLM Training
A practical guide to optimizers for LLM training: how AdamW works and why decoupled weight decay matters, the memory cost problem at 7B to 70B scale, Adafactor factored second moments for pretraining, 8-bit Adam as a drop-in memory reduction, Lion sign-based updates and its hyperparameter tradeoffs, and a decision framework for matching optimizer to training scale and budget.