Why GPU Cloud Costs Vary So Much
GPU cloud infrastructure is sold through dramatically different business models, and understanding them is key to finding the right cost-performance fit. The major hyperscalers (AWS, GCP, Azure) charge premium prices for the ecosystem value they provide — managed services, compliance certifications, SLAs, and deep integration with the rest of their platforms. Specialised GPU cloud providers (Lambda Labs, CoreWeave, Vast.ai, RunPod) offer bare GPU capacity at significantly lower prices by stripping away the managed services overhead and focusing on a single value proposition: give researchers and developers GPU access at the lowest possible price.
For LLM inference and training workloads that do not require deep integration with AWS or Azure services, the specialised providers can deliver 2–5x lower cost for equivalent GPU capacity. The trade-off is less managed infrastructure, fewer compliance certifications, and more operational self-reliance. For teams comfortable managing their own deployment — which increasingly means “can run vLLM on a server with a GPU” — the cost savings are substantial.
Provider Comparison: Price and GPU Availability (Mid-2026)
Provider | A100 80GB $/hr | H100 80GB $/hr | RTX 4090 $/hr | Notes
---------------|----------------|----------------|---------------|------------------
AWS (on-demand)| $4.10 | $12.29 | N/A | Enterprise SLAs
GCP (on-demand)| $3.67 | $12.34 | N/A | Vertex AI integration
Azure (on-dem.)| $3.40 | $12.29 | N/A | OpenAI compliance
Lambda Labs | $1.99 | $3.29 | $0.80 | Dedicated instances
CoreWeave | $2.06 | $2.98 | $0.76 | Kubernetes-native
RunPod (secure)| $1.99 | $2.79 | $0.74 | Community + secure
Vast.ai | $0.90-1.50 | $1.50-2.50 | $0.30-0.60 | Market pricing
Together AI | N/A (API) | N/A (API) | N/A | $0.59-0.88/1M tokens
Groq | N/A (API) | N/A (API) | N/A | $0.59-0.79/1M tokens
The difference between hyperscaler and specialised provider pricing is stark: A100 at $4.10/hr on AWS versus $0.90–$2.06/hr on specialised providers is a 2–4x cost difference for equivalent hardware. At scale — 10 GPUs running continuously for a month — this translates to tens of thousands of dollars in monthly savings.
Lambda Labs: Reliable Reserved Instances
Lambda Labs positions itself as the most reliable specialised GPU provider, with a focus on dedicated instances (not shared or spot) and straightforward pricing. They offer on-demand A100 and H100 instances with no preemption risk, making them suitable for long-running training jobs that cannot tolerate interruption. Lambda also offers reserved instance pricing (1-year and 3-year terms) at further discounts of 30–50% off on-demand rates.
Lambda’s deployment experience is similar to a basic cloud provider — you spin up a VM with CUDA pre-installed, SSH in, and you are running. No managed container orchestration, no auto-scaling, no managed databases. For teams that want a clean GPU VM at low cost without the hyperscaler overhead, Lambda is the most straightforward option. Their filesystem service (persistent Lambda storage) and team management features have matured significantly and make it viable for small-to-medium team production deployments.
CoreWeave: Kubernetes-Native GPU Cloud
CoreWeave is designed specifically for GPU workloads at scale. Their differentiator is a Kubernetes-native platform built from the ground up for GPU scheduling — deploying containerised LLM inference on CoreWeave feels much closer to a managed Kubernetes service than the bare-VM experience of Lambda. They offer the full Kubernetes API, GPU-aware scheduling, autoscaling, and persistent volumes, making it possible to deploy production vLLM clusters on CoreWeave with similar operational patterns to AWS EKS but at lower GPU costs.
# Example CoreWeave vLLM deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 2
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["--model", "meta-llama/Llama-3.1-8B-Instruct", "--port", "8000"]
resources:
limits:
nvidia.com/gpu: "1"
memory: "40Gi"
CoreWeave is the right choice for teams that want low-cost GPU capacity with production-grade infrastructure — Kubernetes, autoscaling, persistent storage — without the hyperscaler price premium. Their H100 pricing at roughly $2.98/hr versus AWS at $12.29/hr represents a 4x cost reduction for teams whose workloads do not require AWS-specific services.
RunPod: Flexible Secure and Community Instances
RunPod offers two tiers: Secure Cloud (dedicated hardware, higher availability guarantees) and Community Cloud (consumer hardware contributed by third parties, lower prices but variable reliability). For production LLM inference, the Secure Cloud tier is appropriate. For development, experimentation, and interruptible training jobs, Community Cloud can deliver extremely low prices — RTX 4090 at $0.30–$0.60/hr in Community Cloud versus $0.74/hr in Secure Cloud.
RunPod’s pod interface is GPU-VM-based but with a cleaner user experience than a raw VM. They provide pre-built pod templates with common LLM stacks (Ollama, vLLM, Jupyter with PyTorch) that launch in minutes. The network storage system allows data to persist across pod restarts. For individual developers and small teams who want flexibility — spin up a pod for a training run, terminate it when done, pay only for active use — RunPod’s on-demand model is cost-effective.
Vast.ai: The GPU Marketplace
Vast.ai is a marketplace where GPU owners (data centres, individuals with high-end hardware) list their capacity for rent at prices they set. This creates a genuinely competitive market where GPU prices are driven down by supply and demand rather than set by a single provider. The result is often the lowest available GPU prices — RTX 4090s at $0.30–$0.50/hr, A100 80GB at $0.90–$1.50/hr — but with highly variable hardware quality, reliability, and network performance.
For workloads that can tolerate variability — pre-training experiments, hyperparameter searches, batch processing jobs where a failure just means re-running — Vast.ai can dramatically reduce compute costs. For latency-sensitive production inference or long-running training jobs that cannot be interrupted, the reliability risk makes it unsuitable. The search and filtering interface lets you select instances by GPU model, VRAM, network bandwidth, reliability score (based on historical uptime), and location, giving reasonable control over the quality spectrum.
Hosted Inference APIs: Together AI and Groq
For teams that want open-source model quality without managing GPU infrastructure, hosted inference APIs are an alternative to raw GPU rental. Together AI and Groq both offer per-token pricing for popular open-source models at rates far below equivalent GPU rental costs for typical inference workloads.
Together AI offers Llama 3.3 70B at $0.88/million tokens input and output — equivalent to running your own A100 server at roughly 50 million tokens per hour of utilisation to break even. For workloads well below that rate, Together AI is cheaper than provisioning your own GPU. Groq uses custom LPU (Language Processing Unit) hardware that delivers exceptionally fast token generation — Llama 3.3 70B at hundreds of tokens per second — at $0.59/$0.79 per million tokens input/output. For latency-sensitive applications where speed matters and data residency requirements allow third-party APIs, Groq’s throughput is hard to match at any price for interactive use cases.
Spot and Interruptible Instances
All major hyperscalers and most specialised providers offer spot or preemptible GPU instances at 60–80% discounts off on-demand prices. AWS spot A100 instances run roughly $10/hr versus $33/hr on-demand. GCP preemptible A100 instances run roughly $9/hr versus $29/hr. These instances can be terminated with short notice (2 minutes on AWS, 30 seconds on GCP), making them appropriate for training jobs with regular checkpointing but not for production inference.
For training workloads, spot instances with robust checkpointing are often the most cost-effective path on any provider. Implement checkpoint-and-resume logic at the framework level (PyTorch Lightning, Hugging Face Trainer, and most major training frameworks support this natively), set your checkpoint interval to match the expected spot preemption frequency, and use spot instances for the majority of your training compute. The savings are significant enough that the operational overhead of handling preemptions is almost always worth it for non-trivial training runs.
Choosing the Right Provider
A practical decision framework: if you need enterprise compliance, SLAs, or deep AWS/GCP/Azure service integration, pay the hyperscaler premium — the infrastructure value justifies it. If you need Kubernetes-native GPU infrastructure at lower cost for production LLM serving, CoreWeave is the strongest option. If you need simple GPU VMs for training at low cost with reliable uptime, Lambda Labs is the straightforward choice. If you need maximum flexibility and low cost for development and experimentation, RunPod or Vast.ai deliver the lowest prices with acceptable trade-offs for non-production workloads. If you need open-source model inference without managing GPUs at all, Together AI or Groq provide excellent per-token economics for moderate-volume use cases. Most teams end up using multiple providers simultaneously — hyperscalers for production serving with compliance requirements, specialised providers for training and development where cost matters more than managed services.
Getting Started on Each Platform
On Lambda Labs, create an account, add an SSH key, and launch an on-demand instance from the dashboard. Lambda pre-installs CUDA, PyTorch, and common ML libraries on their instances. You can be running vLLM within ten minutes of account creation. The Lambda Hub marketplace provides one-click deployment templates for common LLM serving setups.
On CoreWeave, you provision resources through their Kubernetes API using standard manifests. If you already know Kubernetes, the learning curve is minimal — CoreWeave behaves like a well-run managed Kubernetes service with GPU-aware scheduling built in. Their documentation includes ready-to-use manifests for vLLM, Triton, and other inference servers. Access is by application — most legitimate requests are approved within a few days.
On RunPod, the console provides a template marketplace with pre-configured pods for Ollama, vLLM, Jupyter, and other common setups. Select your GPU type, choose a template, and a configured environment is ready in about 60 seconds. The persistent network volume system ensures your model weights and data survive pod restarts without re-downloading.
On Vast.ai, the search interface lets you filter by GPU model, VRAM, reliability score, network speed, and price. Sort by cost and filter by reliability score above 95% to find cheap but dependable instances. Use their CLI for automating instance management in scripts:
pip install vastai
vastai set api-key your_api_key
# Search for RTX 4090 instances under $0.50/hr with high reliability
vastai search offers "reliability > 0.95 gpu_name=RTX_4090 dph_total < 0.50"
# Create an instance
vastai create instance OFFER_ID --image pytorch/pytorch:latest --disk 50
Cost Optimisation Strategies Across Providers
Several strategies apply regardless of which provider you choose. Use spot/preemptible instances for training — the 60–80% discount almost always justifies the checkpoint-and-resume overhead for non-trivial training jobs. Right-size your instances — running a 7B model on an 80GB A100 wastes most of the VRAM you are paying for. Use the smallest GPU that fits your model with adequate KV cache headroom. Terminate idle instances — GPU instances billed hourly accumulate cost even when idle. Implement auto-shutdown scripts that terminate instances after a period of inactivity, and use instance scheduling to avoid paying for overnight idle capacity. Batch your workloads — instead of running a small GPU instance 24/7, accumulate work and process it in a burst on a larger, cheaper-per-FLOP instance. Many LLM processing pipelines can tolerate hourly or daily batch runs rather than continuous serving. Use reserved instances for predictable baseline load — if you have a steady minimum GPU requirement, a 1-year or 3-year reserved instance commitment typically saves 30–50% over on-demand pricing across all major providers. Combine reserved instances for the base with on-demand or spot for peak handling.
Security and Data Considerations
Specialised GPU cloud providers vary significantly in their security posture and compliance certifications. Lambda Labs, CoreWeave, and RunPod Secure Cloud all offer SOC 2 Type II compliance and standard enterprise security controls. Vast.ai Community Cloud instances run on third-party hardware with no formal compliance guarantees — appropriate for public datasets and open-source model work, not for proprietary data or regulated workloads. Before sending sensitive training data or private model weights to any provider, verify their data handling policies, encryption at rest and in transit, and whether the provider's employees have access to workload data. For production inference handling user data, treat specialised GPU clouds with the same scrutiny you would apply to any third-party infrastructure provider.