ReLU vs GELU vs SiLU: Activation Functions for Deep Learning and LLMs
A practical comparison of activation functions for ML engineers: ReLU and the dying neuron problem, GELU and why transformers switched from ReLU, SiLU and Swish for efficiency, SwiGLU and GeGLU gated variants used in Llama and PaLM, the 2/3 expansion factor explained, and a decision guide for choosing activations in CNNs versus transformer architectures.