Attention Mechanisms Explained: From Scaled Dot-Product to GQA
A practical guide to how attention actually works — scaled dot-product, multi-head, MQA, GQA, Flash Attention, and RoPE — with the implications for memory, throughput, and context length that matter for production deployments.