AWS EC2 GPU Instance Families for LLMs
AWS offers several GPU instance families suited to LLM workloads, each targeting a different use case and budget point. Understanding the differences prevents paying for capabilities you do not need or under-provisioning for your actual workload.
p3 instances use NVIDIA V100 GPUs (16 GB VRAM each). The p3.2xlarge gives you one V100; the p3.8xlarge gives four; the p3.16xlarge gives eight. V100 is the oldest generation in active EC2 service and the cheapest entry point for managed GPU inference. Suitable for 7B–13B model inference and experimentation, but limited VRAM makes larger models impractical.
p4d instances use NVIDIA A100 40GB GPUs connected by NVLink. The p4d.24xlarge provides eight A100 40GB GPUs with 320 GB total VRAM — the standard configuration for 70B and 405B model work on AWS. NVLink interconnects deliver near-linear multi-GPU scaling efficiency. This is the most widely used instance type for LLM training and large model inference on AWS.
p4de instances use NVIDIA A100 80GB GPUs, doubling VRAM per card to 640 GB total on the p4de.24xlarge. Suitable when p4d’s 320 GB is insufficient — 405B models at Q8, or large-batch inference workloads that benefit from more KV cache space.
p5 instances use NVIDIA H100 SXM5 80GB GPUs on 3,200 GB/s NVSwitch interconnects. The p5.48xlarge provides eight H100s with 640 GB total VRAM and the highest available throughput on AWS. The H100’s 2.5x throughput advantage over A100 translates directly: LLM training jobs that take days on p4d complete in roughly half the time on p5.
g5 instances use NVIDIA A10G 24GB GPUs and target cost-sensitive inference. The g5.xlarge (one A10G) is the cheapest GPU inference option on AWS with modern hardware, suitable for 7B–13B model serving at production quality and low cost.
Instance Pricing Reference (Mid-2026)
Instance | GPU Config | On-Demand $/hr | Spot $/hr
------------------|-------------------|----------------|----------
g5.xlarge | 1x A10G 24GB | $1.006 | ~$0.30
g5.12xlarge | 4x A10G 96GB | $5.672 | ~$1.70
p3.2xlarge | 1x V100 16GB | $3.060 | ~$0.92
p3.8xlarge | 4x V100 64GB | $12.240 | ~$3.70
p4d.24xlarge | 8x A100 40GB | $32.773 | ~$9.80
p4de.24xlarge | 8x A100 80GB | $40.969 | ~$12.30
p5.48xlarge | 8x H100 80GB | $98.320 | ~$29.50
Spot instances provide 60–70% discounts but can be interrupted with 2-minute notice. For training jobs with checkpointing enabled, spot instances on p4d or g5 offer the best cost-performance ratio on AWS.
Setting Up an EC2 GPU Instance for LLM Inference
The fastest path from zero to a running vLLM server on AWS:
# 1. Launch instance via AWS CLI
aws ec2 run-instances --image-id ami-0abcdef1234567890 # Deep Learning AMI (Ubuntu 22.04)
--instance-type g5.xlarge --key-name your-key-pair --security-group-ids sg-xxxxxxxxxx --subnet-id subnet-xxxxxxxxxx --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":200}}]' --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=llm-server}]'
# 2. SSH into the instance
ssh -i your-key.pem ubuntu@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com
# 3. Verify GPU
nvidia-smi
# 4. Install vLLM (CUDA already installed on Deep Learning AMI)
pip install vllm
# 5. Set HuggingFace token
export HUGGING_FACE_HUB_TOKEN=hf_your_token
# 6. Start serving
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --port 8000 --host 0.0.0.0
The AWS Deep Learning AMI pre-installs CUDA, cuDNN, PyTorch, and other ML dependencies, saving 30–60 minutes of setup compared to a blank Ubuntu instance. Always use it as your base for LLM workloads on EC2.
Storing Model Weights: S3 vs EBS vs EFS
Model weights need to be accessible to your EC2 instance quickly. Three storage options have different trade-offs. EBS (Elastic Block Store) attached directly to the instance is the fastest option — NVMe SSDs deliver 3–7 GB/s sequential read throughput. A 200–500 GB gp3 volume attached to your instance stores all your models locally and loads them in 20–60 seconds. The downside is cost ($0.08/GB/month) and the fact that EBS volumes are tied to a specific Availability Zone. S3 stores models cheaply ($0.023/GB/month) but download speeds vary — typically 500 MB/s to 2 GB/s from EC2 to S3 in the same region, meaning a 40 GB model takes 20–80 seconds to download. Use S3 for model storage and copy to a local EBS volume on first use. EFS (Elastic File System) is shared filesystem storage accessible across multiple instances simultaneously — useful if several EC2 instances need to serve the same model without each maintaining their own copy, at higher latency than EBS.
# Copy model from S3 to local EBS on instance startup
aws s3 cp s3://your-bucket/models/llama-3.1-8b/ /home/ubuntu/models/llama-3.1-8b/ --recursive
# Takes ~30 seconds for an 8B model from same-region S3
Security Groups and VPC Configuration
Never expose your vLLM server directly to the internet. Configure your security group to allow inbound traffic on port 8000 only from your specific IP addresses or from within your VPC:
# Allow access only from your IP
aws ec2 authorize-security-group-ingress --group-id sg-xxxxxxxxxx --protocol tcp --port 8000 --cidr YOUR_IP/32
# Or allow from within VPC only (for internal services)
aws ec2 authorize-security-group-ingress --group-id sg-xxxxxxxxxx --protocol tcp --port 8000 --cidr 10.0.0.0/8
For production deployments, place your EC2 instances in a private subnet and expose the LLM API through an Application Load Balancer (ALB) in a public subnet. The ALB handles TLS termination, authentication via Cognito or header-based API keys, and load balancing across multiple inference instances.
Auto Scaling for Variable Workloads
GPU instances are expensive when idle. For workloads with variable traffic — heavy during business hours, low overnight — auto scaling groups reduce cost by terminating excess instances when demand drops. Set up a target tracking policy based on custom CloudWatch metrics from your vLLM Prometheus endpoint:
# Create CloudWatch metric from vLLM queue depth
import boto3
cloudwatch = boto3.client('cloudwatch', region_name='us-east-1')
def push_queue_metric(queue_depth: int):
cloudwatch.put_metric_data(
Namespace='LLMInference',
MetricData=[{
'MetricName': 'QueueDepth',
'Value': queue_depth,
'Unit': 'Count'
}]
)
An auto scaling policy that scales out when average queue depth exceeds 5 and scales in when it drops below 1 keeps inference latency stable under variable load while minimising idle GPU costs. GPU instances take 3–5 minutes to start and load models, so configure your scale-out policy to trigger early — before the queue becomes unacceptably long.
Spot Instances with Fault Tolerance
For training jobs on AWS, spot instances on p4d or p5 reduce costs by 60–70%. The key to making spot instances work for training is robust checkpointing and automatic resumption. Use SageMaker’s managed spot training feature, which handles interruption detection and automatic restart:
import sagemaker
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point='train.py',
role=sagemaker.get_execution_role(),
instance_type='ml.p4d.24xlarge',
instance_count=1,
use_spot_instances=True,
max_wait=86400, # Maximum time including interruptions (24 hours)
max_run=72000, # Maximum actual training time (20 hours)
checkpoint_s3_uri='s3://your-bucket/checkpoints/',
checkpoint_local_path='/opt/ml/checkpoints/'
)
SageMaker saves checkpoints to S3 automatically on interruption and resumes from the latest checkpoint on restart. With this pattern, a training job that runs on spot can experience multiple interruptions without losing progress, typically achieving 50–70% cost reduction versus on-demand with minimal additional engineering overhead.
AWS Inferentia and Trainium: The Cheaper Alternative
For sustained production inference at scale, AWS’s custom silicon chips offer compelling economics. Inferentia2 (inf2 instances) and Trainium2 (trn2 instances) are purpose-built for transformer inference and training respectively, at significantly lower cost than equivalent GPU instances. The inf2.8xlarge provides 32 GB of Inferentia2 accelerator memory at roughly $0.76/hr — substantially cheaper than a g5.xlarge at $1.00/hr with better sustained inference throughput for supported models. The trade-off is that models must be compiled to the Neuron SDK format, which adds a one-time compilation step and limits flexibility to supported architectures. For stable production workloads running continuously on a small set of models, Inferentia2 is worth evaluating as a cost optimisation. For research, experimentation, and workloads requiring frequent model changes, GPU instances remain more flexible.
Security Groups and VPC Configuration
Never expose your vLLM API directly to the internet. Configure security groups to restrict inbound access on port 8000 to your specific IP range or your VPC CIDR only. For production deployments, place inference instances in a private subnet and expose the API through an Application Load Balancer in a public subnet. The ALB handles TLS termination, API key authentication via header inspection, and load balancing across multiple inference instances. This pattern isolates your model server from direct internet access while remaining accessible to your application tier.
Auto Scaling for Cost Efficiency
GPU instances cost money even when idle. For variable workloads — heavy during business hours, quiet overnight — auto scaling groups terminate excess instances when demand drops. Push vLLM queue depth as a custom CloudWatch metric and create a target tracking policy: scale out when queue depth exceeds a threshold, scale in when it returns to zero. GPU instances take 3–5 minutes to start and load models, so configure scale-out to trigger before the queue becomes unacceptably deep. For inference workloads with predictable patterns, scheduled scaling — adding capacity at 8am and removing it at 8pm — is simpler and more cost-effective than reactive scaling.
Spot Instances for Training
For LLM training jobs on AWS, spot instances on p4d or p5 reduce costs by 60–70%. Use SageMaker’s managed spot training to handle interruption and automatic restart from S3-backed checkpoints:
estimator = PyTorch(
entry_point='train.py',
instance_type='ml.p4d.24xlarge',
use_spot_instances=True,
max_wait=86400,
checkpoint_s3_uri='s3://your-bucket/checkpoints/',
checkpoint_local_path='/opt/ml/checkpoints/'
)
SageMaker saves checkpoints on interruption and resumes automatically. With robust checkpointing, spot training delivers 50–70% cost reduction versus on-demand with minimal additional complexity.
AWS Inferentia2: Cheaper Production Inference
For sustained, high-volume inference on a stable set of models, AWS Inferentia2 (inf2 instances) offers substantially lower cost than equivalent GPU instances. The inf2.8xlarge provides 32 GB of accelerator memory at roughly $0.76/hr versus $1.00/hr for a g5.xlarge. The trade-off is that models must be compiled to AWS Neuron SDK format — a one-time step that limits flexibility. For research and frequent model changes, GPU instances are more practical. For production workloads running the same model continuously at high volume, Inferentia2 is worth evaluating as a significant cost reduction after your model choice is stable.
Recommended Instance Selection
For getting started and development: g5.xlarge — one A10G 24GB, $1/hr on-demand, runs 7B–13B models well, cheapest modern GPU option on AWS. For production 70B inference: p4d.24xlarge spot — eight A100 40GB, ~$10/hr spot, runs 70B at Q4 with excellent throughput. For the fastest training: p5.48xlarge spot — eight H100 80GB, ~$30/hr spot, 2.5x faster than p4d for LLM training at roughly 3x the cost. For cost-sensitive 7B–13B serving at scale: inf2.8xlarge — Inferentia2, $0.76/hr, excellent sustained throughput for stable production models. Starting with g5 instances for development and prototyping, then graduating to p4d or p5 spot for training and high-traffic inference, is the most cost-effective AWS path for most LLM teams.
Deploying with Docker on EC2
For reproducible deployments, containerising your LLM server with Docker ensures consistent behaviour across development, staging, and production instances. The official vLLM Docker image handles all CUDA dependencies:
# Install Docker and NVIDIA Container Toolkit on EC2
sudo apt update && sudo apt install -y docker.io
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed "s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g" | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker
# Run vLLM container
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN vllm/vllm-openai:latest --model meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.90
With Docker in place, you can use AWS ECR to store custom images with pre-downloaded models, dramatically reducing cold-start time on new instances. Build an image with the model baked in, push to ECR, and your launch template pulls it on instance start — turning a 5-minute model download into a 30-second image pull from ECR’s fast regional network.
Cost Estimation for Common Workloads
A few reference calculations for planning AWS EC2 LLM spend. A development setup running a g5.xlarge 8 hours per day on-demand costs roughly $240/month. A production inference cluster running two p4d.24xlarge instances continuously on spot costs roughly $14,000/month — significantly less than the $47,000/month on-demand equivalent. A weekly training run using a p5.48xlarge for 12 hours on spot costs roughly $350 per run. These numbers shift with spot price fluctuations, but they give a useful order-of-magnitude sense for budgeting. Always set AWS Budget alerts at both warning (80%) and critical (100%) thresholds before starting any new GPU workload — unexpected costs from misconfigured auto-scaling or forgotten running instances are the most common source of AWS billing surprises for LLM teams.
EC2 vs SageMaker for LLM Workloads
EC2 gives you maximum flexibility — you control the entire software stack, deployment process, and infrastructure configuration. SageMaker adds managed services on top of EC2: automatic instance provisioning, built-in experiment tracking, managed spot training with automatic checkpointing, and a model registry. The trade-off is higher cost (SageMaker adds roughly 20–30% on top of underlying EC2 instance costs) and more complex configuration. For teams building a robust MLOps pipeline with multiple stakeholders, SageMaker’s managed features justify the overhead. For individual developers and small teams comfortable managing their own infrastructure, raw EC2 with a simple deployment script is simpler, cheaper, and gives you more direct control over exactly what runs where.