Weight Initialization in Deep Learning: Xavier, Kaiming, and Why It Matters
A practical guide to weight initialization for ML engineers: why poor initialization causes vanishing and exploding gradients, Xavier initialization for tanh and linear activations, Kaiming initialization for ReLU networks, GPT-2 style scaled residual initialization for LLMs, embedding initialization, and a concrete checklist for initializing custom architectures correctly.