Attention Mechanisms Explained: From Scaled Dot-Product to GQA

The attention mechanism is the core operation in every transformer-based model. Understanding how it works — not just that “the model attends to relevant tokens” — is necessary for reasoning about model behavior, memory requirements, inference speed, and the architectural trade-offs in modern LLMs. This article traces the evolution from scaled dot-product attention through multi-head attention to the grouped-query variants that dominate production deployments today.

Scaled Dot-Product Attention

The fundamental attention operation takes three inputs: queries (Q), keys (K), and values (V). The output is a weighted sum of the values, where the weights come from the similarity between each query and all keys. In matrix form: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V, where d_k is the key vector dimension.

The scaling by sqrt(d_k) matters in practice. Without it, dot products grow large in magnitude as d_k increases, pushing the softmax into saturation regions with near-zero gradients — the attention becomes effectively hard, concentrating almost all weight on a single token. Dividing by sqrt(d_k) keeps dot products in a range where softmax gradients flow well during training.

For a sequence of length N, the attention matrix QK^T has shape N×N. Computing it requires O(N^2) operations and O(N^2) memory. This quadratic scaling with sequence length is the fundamental cost of standard attention. For N=4,096 it’s manageable; for N=128,000 (Llama 3’s context length), materializing the full attention matrix requires special treatment — this is exactly the problem Flash Attention solves.

Multi-Head Attention

Multi-head attention runs scaled dot-product attention in parallel across H independent heads, each with its own learned projections. For embedding dimension d_model and H heads, each head operates on a d_model/H dimensional subspace. The head outputs are concatenated and projected back to d_model.

The motivation is representational diversity. A single head learns one type of relationship between tokens. Multiple heads can simultaneously capture different relationship types — syntactic dependencies, coreference, semantic similarity — and the model learns to combine them. Probing studies confirm that different heads do specialize, though the specialization is often not as clean as the intuition suggests.

The memory cost of multi-head attention at inference is dominated by the KV cache. Storing keys and values for each layer and head requires 2 * L * H * d_head * N bytes per request, where L is layer count, H is head count, d_head is head dimension, and N is context length. For Llama 2 70B with 32 layers, 64 heads, head dimension 128, 4,096 token context in fp16: roughly 4GB per concurrent request. At 10 concurrent requests that’s 40GB — most of an A100 80GB just for KV cache. This is why KV cache size is a first-class architectural concern.

Multi-Query Attention (MQA)

Multi-query attention, introduced by Shazeer in 2019, reduces the KV cache by having all query heads share a single set of key and value projections. The KV cache shrinks by a factor of H. Quality degrades modestly — heads can still learn different query patterns to select different information from the shared KV pool — but the memory reduction is substantial. MQA was adopted in PaLM and Falcon where inference cost was the primary constraint.

Grouped-Query Attention (GQA)

GQA, introduced by Ainslie et al. in 2023, is the current standard. Rather than one KV head per query head (MHA) or one shared KV head (MQA), GQA assigns one KV head per group of G query heads. With H=32 query heads and G=8, there are 4 KV heads and the KV cache is 8x smaller than MHA. The empirical finding: with appropriate uptraining, GQA matches MHA quality almost exactly while achieving KV cache close to MQA.

Llama 3 uses GQA with 8 KV heads and 32 query heads — a 4x KV cache reduction versus MHA. Mistral and Qwen2.5 also use GQA. For production deployment, checking a model’s KV head count versus query head count is worth doing — the difference in serving economics at scale is substantial. A 4x smaller KV cache means 4x more concurrent requests at the same GPU memory, or 4x longer contexts for the same memory budget.

Flash Attention

Flash Attention computes the same mathematical result as standard scaled dot-product attention but avoids materializing the full N×N attention matrix in HBM. Instead, it tiles the computation into blocks that fit in the GPU’s fast on-chip SRAM, computes softmax incrementally using the online softmax algorithm, and accumulates the weighted value sum without ever writing the attention matrix to slow HBM. The result: 2–4x faster attention, and O(N) memory instead of O(N^2).

Flash Attention 2 improved parallelization across the sequence dimension, getting closer to peak GPU utilization. Flash Attention 3 targets H100 GPUs with FP8 support and asynchronous WGMMA instructions. For any model training or serving on sequences longer than 1,024 tokens, Flash Attention should be enabled. Enable it with attn_implementation=”flash_attention_2″ in HuggingFace Transformers, or use torch.nn.functional.scaled_dot_product_attention which dispatches to Flash Attention automatically on CUDA.

Sliding Window and Sparse Attention

For very long contexts, even O(N) KV cache can be expensive. Sliding window attention limits each token to attending only to the W most recent tokens, reducing KV cache from O(N) to O(W). Mistral uses sliding window attention in alternating layers, with full attention layers allowing information to propagate beyond the window over multiple layers. This is an elegant compromise — bounded memory cost while preserving global context propagation through depth.

Sparse attention patterns more broadly — where tokens attend to a learned or fixed sparse subset of positions — remain an active research area. The challenge is that irregular sparse patterns don’t map cleanly to the dense matrix operations GPU tensor cores optimize for, requiring custom CUDA kernels that are difficult to write and maintain. Production adoption has been limited compared to the theoretical appeal.

Rotary Position Embeddings (RoPE)

Modern LLMs use RoPE rather than the learned absolute position embeddings from the original transformer. RoPE encodes position by rotating query and key vectors in 2D subspaces by an angle proportional to position, before the dot product. This gives attention a natural way to represent relative position — the dot product between a query at position m and a key at position n depends on their relative offset m-n, not their absolute positions. This relative inductive bias generalizes better to contexts longer than those seen during training.

The practical consequence is context length extension via RoPE scaling. By adjusting the base frequency or applying a scaling factor to rotation angles at inference time, models can generalize to contexts longer than their training length. Llama 3’s 128K context window is achieved via RoPE scaling from an 8K training context. Understanding that position encoding is embedded in the attention computation — not an additive term applied to embeddings — is important for reasoning about why context extension sometimes degrades quality at very long contexts: the model encounters rotation angles outside the range it saw during training, and extrapolation quality depends on how smoothly the rotation frequencies were designed to scale.

What to Know for Production

For serving: GQA KV head count determines your memory-per-request. Check it before selecting a model for high-concurrency deployment. Enable Flash Attention regardless of sequence length — the speed improvement is free and the memory reduction matters at scale. For models with sliding window attention, be aware that very long contexts may have degraded quality on content outside the window, even if the model technically accepts the full context length.

For training: Flash Attention is effectively mandatory for sequences over 2K tokens. Gradient checkpointing through attention layers is high-leverage because activation memory scales quadratically without it. If training on very long sequences (32K+), consider RoPE base frequency adjustments to improve position generalization — the default base frequency of 10,000 used in the original RoPE paper was not designed for these context lengths, and extended base frequencies (500,000 in Llama 3) produce better long-context quality.

Cross-Attention and Encoder-Decoder Models

The attention variants covered above are all self-attention — each position attends to other positions in the same sequence. Cross-attention, used in encoder-decoder architectures like T5 and the original transformer, has a different structure: queries come from the decoder sequence while keys and values come from the encoder output. This allows the decoder to selectively attend to any part of the encoded input at each generation step.

For most modern LLM use cases in 2026, decoder-only models with self-attention have largely displaced encoder-decoder architectures. But cross-attention remains relevant for specific applications: image-to-text models where the encoder processes visual tokens and the decoder generates text, audio models where the encoder processes spectrogram features, and structured prediction tasks where the input and output have different sequence lengths and explicit alignment is beneficial. Understanding that cross-attention is the same mathematical operation as self-attention — just with queries from one sequence and keys/values from another — makes it straightforward to reason about these architectures once self-attention is clear.

The KV cache applies to cross-attention as well. In encoder-decoder models, the encoder keys and values can be cached after the first forward pass and reused for every decoding step, since the encoder output doesn’t change during generation. This is a significant efficiency gain for long inputs: a 1,000-token document encoded once and cached is much cheaper than re-encoding it at every decoding step. Frameworks that implement encoder-decoder serving (like TGI with T5 support) handle this automatically.

KV Cache Eviction and Memory Management

At inference time, the KV cache grows with every generated token. For a model serving many concurrent requests with varying context lengths, KV cache memory management becomes a first-class operational concern. Without active management, long requests can consume disproportionate KV cache memory and reduce the number of concurrent requests the server can handle — a few requests with 100K token contexts can starve dozens of short-context requests that would otherwise fit.

vLLM addresses this with PagedAttention, which manages KV cache memory in fixed-size pages (analogous to virtual memory paging in operating systems) rather than pre-allocating contiguous blocks per request. Pages are allocated as needed and freed when a request completes, allowing the KV cache to be shared efficiently across concurrent requests with different context lengths. The practical result is significantly higher throughput for mixed-length workloads compared to naive contiguous allocation. Understanding that PagedAttention is fundamentally a KV cache memory allocator — not a change to the attention computation itself — clarifies why it can be combined with GQA, Flash Attention, and quantization without affecting model correctness.

Speculative decoding is another inference technique that interacts with the KV cache in non-obvious ways. A small draft model generates multiple candidate tokens quickly, which a larger verifier model then accepts or rejects in a single forward pass. The KV cache must be updated to reflect only the accepted tokens, requiring careful bookkeeping. Systems that implement speculative decoding need to manage KV cache rollback when tokens are rejected — an implementation detail that adds complexity but doesn’t change the underlying attention math.

Attention in Vision and Multimodal Models

The attention mechanism described here applies directly to text transformers, but the same operation appears throughout modern vision and multimodal architectures. Vision Transformers (ViT) divide images into patches and apply self-attention across patch embeddings — the same scaled dot-product attention with the same QKV structure, just operating on spatial patches rather than token sequences. The quadratic scaling with sequence length becomes quadratic scaling with the number of patches, which is why high-resolution ViT inference is expensive and why efficient attention variants are actively researched for vision.

Multimodal models like LLaVA and the GPT-4V family encode image patches as visual tokens that are prepended to the text context before the language model’s attention layers. From the attention mechanism’s perspective, visual tokens are indistinguishable from text tokens — they participate in the same QKV attention computation and are attended to by text tokens and vice versa. The cross-modal understanding emerges from training, not from any special attention structure. This architectural simplicity is part of why the “image tokens + text tokens fed to a standard language model” paradigm has become dominant — it leverages existing attention infrastructure without modification.

For multimodal serving, the key implication is that visual tokens consume KV cache memory identically to text tokens. A high-resolution image encoded as 1,024 visual tokens occupies the same KV cache space as 1,024 text tokens, which is substantial for long conversations with multiple images. Production multimodal systems either cache visual token KV states aggressively across turns (since the image doesn’t change) or use learned visual compressors (like Q-Former in BLIP-2) to reduce the number of visual tokens before feeding them to the language model’s attention layers.