When Do You Need Multiple GPUs?
A single GPU handles most LLM workloads up to 34B parameters at Q4. You need multiple GPUs when: the model weights do not fit in a single GPU’s VRAM, you need higher throughput than one GPU delivers for concurrent users, or your training job would take impractically long on a single card. For LLM inference specifically, the most common reason to go multi-GPU is serving 70B models at Q4 or larger — these require 40+ GB of VRAM that no single consumer GPU provides, and only the highest-end data centre GPUs (A100 80GB, H100 80GB) cover it on a single card.
Two Parallelism Strategies
Tensor parallelism splits individual weight matrices across GPUs. Each GPU holds 1/N of the weights and they communicate via all-reduce operations at each transformer layer. This reduces per-GPU memory proportionally with the number of GPUs and delivers near-linear throughput scaling for inference, at the cost of requiring high-bandwidth interconnects (NVLink preferred). It is the standard approach for multi-GPU LLM inference and the default in vLLM.
Pipeline parallelism assigns different layers to different GPUs, passing activations sequentially. Lower communication overhead than tensor parallelism but introduces pipeline bubble latency. Better for training throughput than interactive inference latency. Less commonly used for inference serving.
For most home and small-team multi-GPU setups, tensor parallelism via vLLM is the right choice.
Hardware Requirements for Multi-GPU LLM Setups
Multi-GPU setups have specific hardware requirements beyond just having two GPUs:
Motherboard: Needs PCIe slots with sufficient bandwidth. For dual GPUs, an ATX or EATX board with two PCIe x16 slots (running at x8 each in dual-GPU mode) is the minimum. For 4-GPU setups, a server board or HEDT (High-End Desktop) platform like AMD TRX50 or Intel W790 is needed.
CPU: For inference serving, the CPU is not the bottleneck — any modern desktop CPU suffices. For training with data loading bottlenecks, more CPU cores and memory bandwidth help. A Ryzen 9 7950X or equivalent is a good balance for multi-GPU workloads.
PCIe bandwidth: Consumer dual-GPU setups run at PCIe x8/x8 rather than x16/x16. For inference, this is acceptable — GPU-to-GPU communication goes through NVLink (if available), not PCIe. For data-loading-heavy training, PCIe bandwidth can become a bottleneck with fast NVMe storage and high batch sizes.
NVLink: Consumer GeForce RTX cards do not support NVLink — they communicate only via PCIe. Data centre cards (A100, H100) and Quadro/RTX professional cards support NVLink. For consumer dual-GPU inference, PCIe communication adds measurable overhead — tensor parallel efficiency on two RTX 4090s is typically 75–85% rather than the near-linear scaling NVLink enables on data centre hardware.
Power supply: Two RTX 4090s (450W TDP each) require a 1,200W or higher PSU with sufficient PCIe power connectors. Two RTX 5090s (575W TDP each) require a 1,400–1,600W PSU. Use an 80+ Platinum or Titanium rated unit at this wattage — efficiency matters when running continuously.
Setting Up Dual Consumer GPUs with vLLM
Install both GPUs physically, verify they are both detected, then use vLLM’s tensor parallel flag:
# Verify both GPUs are visible
nvidia-smi
# Expected output showing both GPUs:
# GPU 0: NVIDIA GeForce RTX 4090 (24GB)
# GPU 1: NVIDIA GeForce RTX 4090 (24GB)
# Start vLLM with tensor parallelism across both GPUs
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --max-model-len 4096 --port 8000
# The model is automatically split across both GPUs
# Each GPU holds half the weight matrices
With two RTX 4090s (48 GB total), Llama 3.3 70B at Q4 loads with comfortable headroom at 4K context. Throughput is typically 35–45 tok/s — lower than ideal linear scaling from a single card’s performance because PCIe communication overhead limits efficiency versus NVLink.
Ollama Multi-GPU Configuration
Ollama automatically uses all available GPUs for a single model, distributing layers across them:
# Ollama detects both GPUs automatically
ollama serve
# Pull and run a 70B model — Ollama will distribute across available GPUs
ollama pull llama3.3:70b-instruct-q4_K_M
ollama run llama3.3:70b-instruct-q4_K_M
# Verify both GPUs are being used during inference
nvidia-smi dmon -s u # watch utilisation on both GPUs
If only one GPU is being used, check that both are visible to the system with nvidia-smi and that Ollama is running as a user with access to both. The CUDA_VISIBLE_DEVICES environment variable can sometimes limit GPU visibility — ensure it is not set to a single GPU value in your environment.
Four-GPU Setups for 405B Models
Running 405B models requires approximately 200+ GB of VRAM at Q4. This necessitates either 8× A100/H100 80GB (data centre configuration) or creative use of consumer hardware. Four RTX 4090s at 96 GB total still fall short of 200 GB for 405B Q4. Practical options at the consumer level are limited: the most accessible path is cloud GPU rental (AWS p4d.24xlarge or p5.48xlarge) for occasional 405B inference rather than dedicated on-premises hardware. For teams that genuinely need regular 405B inference, a pair of H100 80GB instances on CoreWeave or Lambda Labs at approximately $6/hr combined is the most cost-effective approach in 2026.
Training on Multi-GPU Consumer Hardware
For LoRA fine-tuning across multiple GPUs with Hugging Face Accelerate:
# Install accelerate
pip install accelerate
# Configure for multi-GPU
accelerate config
# Select: multi-GPU, number of GPUs: 2, mixed precision: bf16
# Launch training across both GPUs
accelerate launch train.py --model_name meta-llama/Llama-3.1-8B-Instruct --dataset_path ./training_data.jsonl --output_dir ./output --num_train_epochs 3 --per_device_train_batch_size 4 --gradient_accumulation_steps 4
With two RTX 4090s, an 8B LoRA fine-tuning run that takes 4 hours on one GPU completes in roughly 2.2 hours — about 1.8x speedup accounting for data loading overhead. The speedup improves with larger models where the GPU compute fraction of total training time is higher relative to data loading and communication overhead.
NVLink Bridging for Professional GPUs
If your multi-GPU setup uses NVLink-capable professional cards (RTX 6000 Ada, A40, A100), connecting them with an NVLink bridge dramatically improves tensor parallel efficiency. NVLink delivers 600–900 GB/s bidirectional bandwidth versus PCIe x16’s 32 GB/s bidirectional — a 20–28x improvement in inter-GPU bandwidth. This translates to near-linear scaling of inference throughput with GPU count, versus the 75–85% efficiency typical on PCIe consumer setups. NVLink bridges are standard with server-grade cards but typically not available for consumer GeForce GPUs, which is one of the key reasons data centre hardware delivers better multi-GPU scaling than consumer hardware for LLM workloads.
Monitoring a Multi-GPU Deployment
Use nvidia-smi dmon to monitor all GPUs simultaneously during inference. Watch for unbalanced utilisation — if one GPU is at 95% and the other at 60%, the tensor parallel split is inefficient, possibly due to imbalanced layer assignment or communication overhead. vLLM’s Prometheus metrics endpoint reports per-GPU utilisation and memory usage, making it straightforward to build Grafana dashboards that track each GPU independently and alert on imbalance or memory pressure. For training runs, tools like Weights and Biases and TensorBoard both support multi-GPU run tracking, logging per-GPU metrics and aggregate throughput automatically when used with Accelerate or PyTorch’s DistributedDataParallel.
Ollama Multi-GPU Configuration
Ollama automatically distributes a model across all available GPUs when a model is too large for one. After physically installing both GPUs and confirming they appear in nvidia-smi, simply start Ollama and run a model that exceeds single-GPU VRAM — it handles the distribution automatically. To verify both GPUs are active during inference, watch nvidia-smi dmon -s u while a generation is running. Both GPUs should show non-trivial utilisation. If only one is being used, check that CUDA_VISIBLE_DEVICES is not restricting visibility in your environment.
Training Multi-GPU with Accelerate
For LoRA fine-tuning across multiple GPUs using Hugging Face Accelerate:
pip install accelerate
accelerate config # Select: multi-GPU, num_gpus: 2, mixed_precision: bf16
accelerate launch train.py --model_name meta-llama/Llama-3.1-8B-Instruct --dataset_path ./training_data.jsonl --output_dir ./output --num_train_epochs 3 --per_device_train_batch_size 4 --gradient_accumulation_steps 4
With two RTX 4090s, an 8B LoRA run that takes 4 hours on one GPU completes in roughly 2.2 hours — about 1.8x speedup accounting for data loading and communication overhead. The speedup improves with larger models where the GPU compute fraction of total training time is higher.
NVLink vs PCIe: Why It Matters
Consumer GeForce RTX cards communicate only via PCIe — they do not support NVLink. PCIe x8 delivers about 16 GB/s unidirectional bandwidth. NVLink on data centre cards delivers 600–900 GB/s bidirectional — roughly 40x more bandwidth. This difference explains why tensor parallel efficiency on dual consumer GPUs is typically 75–85% of ideal linear scaling, while NVLink-connected A100 or H100 pairs achieve 90–95% efficiency. For home setups, PCIe-connected dual consumer GPUs are perfectly workable for 70B inference. For production deployments where scaling efficiency matters, NVLink-capable data centre hardware is worth the premium.
Power, Cooling, and Case Selection
Two RTX 4090s at 450W TDP each require a 1,200W PSU at minimum, with a 1,350W or 1,600W unit providing better headroom. The case must accommodate two full-length, triple-slot GPUs side by side — this rules out many mid-tower cases. Full-tower cases like the Fractal Design Define 7 XL, Lian Li O11D XL, or Corsair 7000D provide adequate space with good airflow. With two high-TDP GPUs, thermal management is critical. Positive pressure airflow — more intake than exhaust — keeps hot air from recirculating. At minimum, install 2–3 front intake fans and 1–2 rear exhaust fans. Monitor both GPU temperatures under sustained inference load and ensure neither exceeds 85°C consistently.
Recommended Multi-GPU Configurations
For 70B Q4 inference on consumer hardware: 2× RTX 4090 — 48 GB total, 35–45 tok/s, ~$3,600 for the GPU pair, proven and well-supported. For 70B Q4 on a tighter budget: 2× RTX 3090 — 48 GB total, 25–35 tok/s, ~$1,400 for the GPU pair used, the best value for 70B inference. For 70B inference with professional reliability: 1× A100 80GB — single GPU, 25 tok/s, simpler setup, no PCIe overhead, available on cloud or $10K+ for ownership. For maximum consumer performance: 2× RTX 5090 — 64 GB total, 60+ tok/s on 70B Q4, ~$4,000 for the pair, handles larger contexts comfortably. The dual 3090 or dual 4090 configurations represent the most practical choices for most home and small-team multi-GPU LLM deployments in 2026.
Monitoring a Multi-GPU Deployment
Use nvidia-smi dmon to watch all GPUs simultaneously during inference. Watch for unbalanced utilisation — if one GPU consistently runs at 95% while the other sits at 60%, the tensor parallel split is inefficient. Common causes include uneven layer distribution or communication overhead dominating on slower PCIe interconnects. vLLM’s Prometheus metrics endpoint reports per-GPU memory usage and utilisation automatically, making it easy to build a Grafana dashboard tracking each GPU independently. For training runs, Weights and Biases and TensorBoard both support multi-GPU logging via Accelerate, capturing per-device throughput and aggregate training speed without custom instrumentation.
When a Single Large GPU Beats Two Smaller Ones
It is worth noting that a single A100 80GB (25 tok/s on 70B Q4) often outperforms two RTX 3090s (35 tok/s) when you account for setup complexity, tensor parallel communication overhead, and the reliability of a single well-tested GPU over a dual-consumer setup. Two consumer GPUs introduces twice the failure points, requires NVLink or PCIe communication between them, and demands more complex scheduling logic in your serving infrastructure. For production deployments where reliability matters more than squeezing maximum throughput from consumer hardware, a single data-centre GPU is often the more practical choice. For cost-conscious home or research setups where maximum capability per dollar matters, dual consumer GPUs remain compelling — just go in with clear expectations about the operational complexity involved.