How to Serve LLMs in Production with vLLM: Setup, Configuration, and Scaling

What Is vLLM and Why Does It Matter?

vLLM is an open-source inference engine built specifically for serving large language models at high throughput. Developed at UC Berkeley and released in 2023, it introduced PagedAttention — a memory management technique that dramatically improves GPU utilisation during inference by handling the KV cache the way an operating system handles virtual memory. The result is throughput improvements of 3–24x over naive inference implementations, with the same model producing the same quality output.

For teams deploying LLMs in production, vLLM has become the de facto standard. It is OpenAI API-compatible out of the box, supports virtually every popular open-source model, handles concurrent requests efficiently, and runs on standard NVIDIA GPUs. If you are moving an LLM from a notebook experiment into a production API, vLLM is almost certainly the right tool.

How PagedAttention Works

To understand why vLLM is fast, it helps to understand the problem it solves. During inference, the attention mechanism needs to store key and value tensors for every token in the context — the KV cache. In naive implementations, memory is allocated contiguously for each sequence’s maximum possible length upfront. This wastes GPU memory (sequences are often much shorter than the maximum), limits the number of requests that can be processed simultaneously, and causes fragmentation as sequences of different lengths are processed.

PagedAttention borrows from the OS virtual memory concept. KV cache memory is divided into fixed-size blocks (pages), and each sequence’s cache is stored in non-contiguous pages that are allocated on demand as the sequence grows. This eliminates wasted allocation, allows the GPU memory to be fully utilised across many concurrent sequences, and enables memory sharing between sequences with the same prefix — a significant optimisation for scenarios like few-shot prompting where many requests share a common context.

The practical impact is that vLLM can process many more concurrent requests on the same hardware compared to frameworks that allocate KV cache naively, translating directly to higher throughput and lower cost per request.

Installation and Basic Setup

vLLM requires a CUDA-capable NVIDIA GPU. Install it with pip:

pip install vllm

The minimum GPU memory depends on the model you want to serve. As a rule of thumb, you need roughly 2 bytes per parameter for FP16 models (a 7B model needs ~14GB), plus additional memory for the KV cache. An RTX 3090 (24GB) handles 7B models comfortably; an A100 (40GB or 80GB) handles 13B–70B models.

Start a basic inference server in one line:

python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-8B-Instruct   --port 8000

This starts an OpenAI-compatible API server. You can immediately point any OpenAI SDK at it by changing the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain transformers in one paragraph."}]
)
print(response.choices[0].message.content)

Key Configuration Options

vLLM exposes a rich set of configuration flags that significantly affect performance, memory usage, and throughput. Understanding the most important ones lets you tune the server for your specific workload.

--tensor-parallel-size splits the model across multiple GPUs. If you have 4 GPUs and a 70B model, set this to 4 to distribute the model weights evenly. Each GPU handles a slice of the computation and they communicate via NVLink or PCIe. This is the primary way to serve models that exceed the memory of a single GPU.

python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-70B-Instruct   --tensor-parallel-size 4   --port 8000

--gpu-memory-utilization controls what fraction of GPU memory vLLM uses for the KV cache (after loading model weights). The default is 0.9 (90%). Lowering it leaves headroom for other processes sharing the GPU; raising it maximises KV cache size and therefore maximum concurrent request capacity.

--max-model-len sets the maximum sequence length (input + output tokens combined). Reducing this from the model’s default frees up KV cache memory and allows more concurrent sequences. If your application never needs 128K token contexts, setting this to 8192 or 16384 can double or quadruple throughput.

--quantization enables quantised model loading. vLLM supports AWQ, GPTQ, and FP8 quantisation, which reduce memory requirements significantly with modest quality trade-offs:

python -m vllm.entrypoints.openai.api_server   --model TheBloke/Llama-2-13B-chat-AWQ   --quantization awq   --gpu-memory-utilization 0.85   --max-model-len 4096   --port 8000

--max-num-seqs limits the maximum number of sequences processed in a single batch. The default is generous; lowering it reduces latency for interactive applications at the cost of throughput.

Continuous Batching: How vLLM Handles Concurrency

Traditional batching in LLM inference groups requests together and processes them as a fixed batch — all sequences start and finish together. This is inefficient because sequences have different lengths: short sequences finish early but their GPU slots sit idle waiting for longer ones to complete before the batch is released.

vLLM uses continuous batching (also called iteration-level scheduling). At each token generation step, the scheduler checks for completed sequences and immediately fills those GPU slots with new requests from the queue. This keeps GPU utilisation high regardless of sequence length variation, which is the normal case in production where some requests are one-sentence and others are multi-paragraph.

The practical effect is that vLLM’s throughput degrades gracefully under load rather than cliff-edging. As request rate increases, latency increases gradually rather than spiking as batch slots fill up. You can monitor the scheduler’s behaviour through the metrics endpoint:

curl http://localhost:8000/metrics

Key metrics to watch: vllm:num_requests_running (currently being processed), vllm:num_requests_waiting (queued), and vllm:gpu_cache_usage_perc (KV cache utilisation). If the waiting queue is consistently non-empty or KV cache utilisation is near 100%, you are at capacity and need more GPU resources or tighter max_model_len constraints.

Deploying vLLM with Docker

The official vLLM Docker image makes deployment straightforward and reproducible:

docker run --runtime nvidia --gpus all   -v ~/.cache/huggingface:/root/.cache/huggingface   -p 8000:8000   --ipc=host   vllm/vllm-openai:latest   --model meta-llama/Llama-3.1-8B-Instruct   --gpu-memory-utilization 0.9

The --ipc=host flag is required for tensor parallelism across multiple GPUs. Mount the Hugging Face cache to avoid re-downloading model weights on every container restart. For production, build a custom image with your specific model pre-downloaded to avoid cold-start download times:

FROM vllm/vllm-openai:latest
RUN python -c "from huggingface_hub import snapshot_download; snapshot_download('meta-llama/Llama-3.1-8B-Instruct')"
CMD ["--model", "meta-llama/Llama-3.1-8B-Instruct", "--port", "8000"]

Deploying vLLM on Kubernetes

For production at scale, Kubernetes with GPU node pools is the standard deployment target. A minimal deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model
        - meta-llama/Llama-3.1-8B-Instruct
        - --gpu-memory-utilization
        - "0.9"
        - --port
        - "8000"
        resources:
          limits:
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8000
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Use a Kubernetes HorizontalPodAutoscaler to scale replicas based on request queue depth or GPU utilisation. For models that fit on a single GPU, horizontal scaling (more replicas) is simpler and more cost-effective than tensor parallelism. For 70B+ models that require multiple GPUs per replica, set nvidia.com/gpu to the required count and enable tensor parallelism.

Speculative Decoding for Lower Latency

vLLM supports speculative decoding, which uses a small draft model to speculatively generate several tokens ahead, then verifies them with the target model in a single forward pass. When the draft model’s predictions are correct (which they often are for common phrases and continuations), you get multiple tokens for the cost of roughly one large model forward pass, significantly reducing time-to-first-token and generation latency:

python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-70B-Instruct   --speculative-model meta-llama/Llama-3.2-1B-Instruct   --num-speculative-tokens 5   --tensor-parallel-size 4

Speculative decoding is most beneficial for interactive applications where latency matters more than raw throughput. It is less beneficial for batch processing workloads where you are already saturating the GPU with many concurrent sequences.

Benchmarking Your Deployment

vLLM ships with a benchmarking tool that simulates concurrent requests and measures throughput and latency percentiles:

python benchmarks/benchmark_serving.py   --backend openai-chat   --model meta-llama/Llama-3.1-8B-Instruct   --dataset-name sharegpt   --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json   --num-prompts 1000   --request-rate 10

Run this benchmark at different request rates to find your throughput ceiling and latency inflection point — the request rate at which p99 latency starts growing rapidly. That inflection point is your practical capacity limit for a given SLA. Size your deployment to operate comfortably below it under peak load, not at it.

For production sizing, a useful rule of thumb is to target 60–70% of the throughput ceiling as your maximum expected load, leaving headroom for traffic spikes. vLLM’s metrics endpoint integrates with Prometheus and Grafana, making it straightforward to build dashboards that track queue depth, latency percentiles, and GPU utilisation in real time.

Prefix Caching for Shared Prompts

When many requests share the same long system prompt or few-shot examples — a common pattern in production deployments — vLLM’s prefix caching feature avoids recomputing the KV cache for that shared prefix on every request. Enable it with a single flag:

python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-8B-Instruct   --enable-prefix-caching

With prefix caching enabled, the first request that includes a given system prompt pays the full computation cost. Subsequent requests with the same prefix retrieve the cached KV blocks instead of recomputing them, reducing time-to-first-token significantly for long shared prefixes. For applications with a 2,000-token system prompt across 10,000 daily requests, this can reduce total compute by 30–50% while also cutting latency for all requests after the first.

Choosing Between vLLM, TGI, and Triton

vLLM is not the only production inference option. Text Generation Inference (TGI) from Hugging Face and NVIDIA’s Triton Inference Server are the main alternatives. vLLM’s throughput advantage from PagedAttention is most pronounced under concurrent load — if you are serving many users simultaneously, vLLM wins clearly. TGI has a smaller footprint and simpler ops story, making it a reasonable choice for lower-traffic deployments or teams already deep in the Hugging Face ecosystem. Triton is the right choice when you need tight integration with NVIDIA’s ecosystem, multi-model serving on shared GPU infrastructure, or hardware-specific optimisations that vLLM does not yet expose. For most teams starting a new production LLM deployment in 2026, vLLM is the default choice — it is the most actively developed, has the broadest model support, and delivers the best throughput for concurrent workloads.

Model Support and Loading from Different Sources

vLLM supports all major model architectures including Llama, Mistral, Qwen, Gemma, Falcon, Mixtral (MoE), and multimodal models like LLaVA. Models can be loaded from Hugging Face Hub, local directories, or S3-compatible storage. For gated models that require authentication, set your Hugging Face token:

export HUGGING_FACE_HUB_TOKEN=hf_your_token_here
python -m vllm.entrypoints.openai.api_server   --model meta-llama/Llama-3.1-8B-Instruct

For loading from a local directory (important in air-gapped or regulated environments where you cannot pull from the internet at runtime):

python -m vllm.entrypoints.openai.api_server   --model /mnt/models/llama-3.1-8b-instruct   --tokenizer /mnt/models/llama-3.1-8b-instruct

Pre-downloading models to local storage before deployment avoids the multi-minute cold start that downloading a 16GB model at container launch would cause. In Kubernetes, this is typically handled by an init container or a persistent volume pre-populated with model weights, ensuring the serving container starts in seconds rather than minutes.

Production Hardening Checklist

Before routing production traffic to a vLLM deployment, a few operational items are worth addressing. Add an authentication layer in front of the API — vLLM itself has no authentication, so any client that can reach port 8000 can make requests. A simple reverse proxy like Nginx or a cloud load balancer with an API key header check is sufficient. Set output length limits with --max-model-len and enforce them at the application layer to prevent runaway generations that monopolise GPU slots. Configure liveness and readiness probes in Kubernetes — the model loading phase can take 30–60 seconds, and the readiness probe prevents traffic from reaching the pod until the server is actually ready to serve. Enable structured logging and route logs to your observability stack so you can correlate request IDs with latency, errors, and GPU metrics. Finally, test your deployment’s behaviour under overload — at some request rate, the queue will grow faster than it drains. Know what happens (graceful queuing, timeout, or error), and make sure your clients handle it correctly rather than retrying aggressively and amplifying the overload.