Tensor Parallelism vs Pipeline Parallelism for LLM Inference

When a single GPU can’t fit a model for inference, the model must be split across multiple GPUs. Tensor parallelism and pipeline parallelism are the two fundamental strategies, making different trade-offs between communication overhead, latency, and implementation complexity. Understanding both — not just that they split the model, but how and with what consequences — is necessary for good serving infrastructure decisions at the 30B+ parameter scale where single-GPU inference becomes impractical.

Why Single-GPU Inference Breaks Down

A 70B parameter model in bf16 requires 140GB just for weights — more than an A100 80GB can hold. Even if memory weren’t the constraint, a single A100’s 2TB/s HBM bandwidth limits decode to roughly 14 tokens per second maximum for 70B, regardless of how many GPUs are connected. Multi-GPU inference addresses both constraints: distributing weights across GPUs solves the memory constraint, and multiplying aggregate memory bandwidth directly improves decode throughput.

Tensor Parallelism

Tensor parallelism splits individual weight matrices across GPUs. Each GPU holds a shard of every layer rather than complete layers. For a linear layer with weight matrix W of shape [d_model, d_ff], column parallelism splits W into N shards of [d_model, d_ff/N] — each GPU computes its shard’s output, and results are summed via AllReduce. Megatron-LM’s design applies column and row parallelism alternately through transformer layers, requiring exactly two AllReduce calls per attention+FFN block regardless of parallelism degree — 64 AllReduces for a 32-layer model per forward pass.

The critical requirement is fast interconnect. NVLink (600 GB/s for A100, 900 GB/s for H100) makes AllReduce across 8 GPUs on one node take microseconds — negligible overhead. InfiniBand at 25 GB/s makes the same AllReduce take milliseconds, dominating latency. Tensor parallelism is therefore an intra-node strategy: efficient across 2–8 NVLink-connected GPUs, poor across nodes. The practical limit is tensor parallel degree 8 (one full DGX node) — beyond that, communication overhead erodes efficiency faster than the additional bandwidth helps.

Pipeline Parallelism

Pipeline parallelism assigns complete layers to each GPU. With 4 GPUs and 32 layers, GPU 0 holds layers 1–8, GPU 1 holds 9–16, and so on. The forward pass flows sequentially — each GPU processes its layers and passes activations to the next. Communication per pipeline stage boundary is a single point-to-point activation transfer of shape [batch_size, seq_len, d_model], rather than the AllReduce of tensor parallelism. Point-to-point transfers work efficiently over InfiniBand, making pipeline parallelism the natural choice for multi-node inference.

The latency implication of pipeline parallelism depends critically on batch size. For a single request (batch size 1), only one GPU is active at a time — the others idle while activations flow through the pipeline. This pipeline bubble adds latency proportional to the number of stages. For batched inference with many concurrent requests, earlier requests are in later stages while new requests enter earlier stages, approaching 100% GPU utilization as batch size grows. Pipeline parallelism suits throughput-optimized batch serving, not low-latency single-request serving.

Combining Both: 2D Parallelism

Large-scale inference (70B+ across multiple nodes) uses both strategies simultaneously. Tensor parallelism handles intra-node splitting over NVLink; pipeline parallelism handles inter-node splitting over InfiniBand. For a 70B model with tensor parallel degree 8 (one node of 8 A100s) and pipeline parallel degree 4 (4 nodes total), each node holds layers for one pipeline stage split across 8 GPUs. The 32 A100s provide 2.56TB aggregate memory and 64TB/s aggregate bandwidth — easily fitting the model with ample KV cache headroom at high batch throughput.

vLLM Implementation

vLLM exposes both strategies via command-line arguments. Tensor parallelism is the default for single-node multi-GPU serving; pipeline parallelism requires Ray for multi-node coordination.

# Tensor parallelism across 4 GPUs, single node
python -m vllm.entrypoints.openai.api_server   --model meta-llama/Meta-Llama-3-70B   --tensor-parallel-size 4   --gpu-memory-utilization 0.90

# 2D parallelism: tensor across 8 GPUs per node, 2 pipeline stages
python -m vllm.entrypoints.openai.api_server   --model meta-llama/Meta-Llama-3-70B   --tensor-parallel-size 8   --pipeline-parallel-size 2   --gpu-memory-utilization 0.90

Memory vs Bandwidth Trade-offs

Tensor parallelism distributes both the memory storage and the bandwidth burden: with degree N, each GPU holds 1/N of the weights and collectively the GPUs provide N times the memory bandwidth, giving near-linear throughput scaling. Pipeline parallelism distributes memory storage but not bandwidth — each GPU holds 1/N of the weights but only uses its own bandwidth during its pipeline stage. For bandwidth-bound decode (which LLM decode is), tensor parallelism therefore provides better throughput scaling than pipeline parallelism at the same number of GPUs, assuming the interconnect is fast enough to keep AllReduce overhead low.

This is why NVLink-connected 8-GPU nodes with tensor parallelism outperform slower-interconnect setups with pipeline parallelism at the same GPU count for interactive serving. The recommendation: maximize tensor parallel degree within a single NVLink node first, then add pipeline stages across nodes when the model doesn’t fit in one node’s aggregate memory. Don’t reach for multi-node pipeline parallelism until you’ve exhausted single-node tensor parallelism options, as the communication overhead of crossing a node boundary carries a real latency cost.

Decision Framework

Use tensor parallelism when the model fits in a single node, latency is a priority, and you have NVLink-connected GPUs. Each request’s computation is distributed across all GPUs simultaneously, minimizing per-request latency — right for interactive API endpoints serving 30–70B models across 4–8 A100s or H100s.

Use pipeline parallelism when the model spans multiple nodes, batch throughput matters more than individual request latency, or InfiniBand is the primary interconnect. Point-to-point communication is efficient over InfiniBand, and throughput scales with batch size — right for throughput-optimized offline batch processing of very large models.

Use both when serving 70B+ models across multiple nodes with combined latency and throughput requirements. Tensor parallelism for intra-node efficiency over NVLink, pipeline parallelism for inter-node scaling over InfiniBand — the 2D combination is the production standard at frontier model scales.

Expert Parallelism

Mixture-of-Experts (MoE) models like Mixtral and DeepSeek introduce a third parallelism dimension: expert parallelism. An MoE model has multiple FFN expert layers, but each token is routed to only a subset of experts per layer. Expert parallelism places different experts on different GPUs — each GPU holds a subset of the full expert pool and only processes tokens routed to its experts. This allows the total parameter count of the model to scale far beyond what fits on any single GPU, while keeping the active compute per token constant (only the activated experts are computed).

Expert parallelism requires all-to-all communication — tokens must be routed to whatever GPU holds their assigned expert, processed, and routed back. Unlike tensor parallelism’s AllReduce (which sums results) or pipeline parallelism’s point-to-point (which passes activations sequentially), all-to-all has communication volume proportional to batch size times sequence length, and its efficiency depends heavily on whether the expert routing is balanced across GPUs. Load imbalance — some GPUs processing many more tokens than others because the router assigns most tokens to a few popular experts — is a common performance bottleneck in MoE serving. vLLM and other serving frameworks handle expert parallelism automatically for supported MoE architectures.

Profiling Parallelism Overhead

Before committing to a specific parallelism configuration in production, profile the actual overhead in your environment. Communication overhead is highly dependent on hardware topology — NVLink bandwidth within a node, InfiniBand bandwidth between nodes, and PCIe bandwidth (much slower than NVLink, relevant for consumer GPU setups without NVLink) all produce very different overhead profiles. Use PyTorch’s distributed profiler or NVIDIA Nsight to measure AllReduce and point-to-point communication time as a fraction of total forward pass time for your specific model and hardware. If communication is more than 15–20% of total time with tensor parallelism, either reduce the tensor parallel degree, switch to pipeline parallelism, or investigate whether your NVLink connections are functioning correctly.

Serving Quantized Models Across Multiple GPUs

Quantization and parallelism compose cleanly — a model can be both quantized and tensor-parallel or pipeline-parallel simultaneously. vLLM supports loading AWQ quantized models with tensor parallelism; the quantized weights are distributed across GPUs in the same way as full-precision weights. The memory savings from quantization and the memory distribution from parallelism are additive: a 70B model in AWQ 4-bit across 4 A100s requires roughly 4.4GB per GPU for weights (17.5GB total / 4 GPUs), leaving the vast majority of each GPU’s 80GB for KV cache. This combination — quantized model with high tensor parallelism degree — is the practical configuration for maximizing concurrency on expensive GPU hardware.

One consideration when combining quantization with tensor parallelism is that the quantization group boundaries must align with the tensor parallel sharding. AWQ and GPTQ with standard group sizes (128) align correctly with tensor parallel sharding for the standard linear layer sizes in most transformer architectures. Non-standard architectures with unusual hidden dimensions may require checking that the group boundaries divide evenly across the tensor parallel shards — misaligned groups can cause subtle accuracy degradation that’s hard to diagnose without careful attention to the quantization-parallelism interaction.

Cost Per Token at Scale

Choosing a parallelism configuration is ultimately a cost optimization problem. The relevant metric is cost per million output tokens at your target latency SLA. A single A100 80GB serving a 70B model isn’t possible (model doesn’t fit), so the comparison is between different multi-GPU configurations: 2×A100 with pipeline parallelism vs 4×A100 with tensor parallelism vs 8×A100 with tensor parallelism, each at different batch sizes and latency profiles. Pipeline parallelism with 2 GPUs requires less hardware but adds pipeline bubble latency at small batch sizes; tensor parallelism with 4 or 8 GPUs costs more hardware but delivers lower per-request latency and higher single-request throughput. For interactive serving where p50 latency under 500ms matters, tensor parallelism on NVLink hardware is usually worth the extra GPU cost. For batch processing pipelines where latency is flexible and throughput per dollar is the objective, pipeline parallelism on cheaper interconnects with large batches is often more economical. Calculate tokens per dollar per GPU-hour for each configuration at your target concurrency, and the answer becomes clear for your specific workload.

Leave a Comment