LLM API Cost Comparison 2026: OpenAI vs Anthropic vs Google vs Open Source

Why LLM API Costs Matter More Than You Think

When teams first start using LLM APIs, cost is rarely the primary concern — getting things working takes priority. But as applications mature and traffic grows, API costs can become a significant operational expense surprisingly quickly. A chatbot handling 10,000 conversations per day at an average of 2,000 tokens per conversation generates 20 million tokens daily — about 600 million tokens per month. At frontier model pricing, that is hundreds to thousands of dollars per month just for one application. Understanding where each dollar goes and whether cheaper alternatives deliver acceptable quality is essential operational work.

Pricing Reference: Major Providers (Mid-2026)

All prices are per million tokens. Output tokens cost more because generation is computationally more expensive than input processing. Prices change frequently — verify at the provider’s pricing page before making decisions.

Provider / Model              | Input $/1M | Output $/1M | Context
------------------------------|------------|-------------|--------
OpenAI GPT-4o                 |    $2.50   |    $10.00   | 128K
OpenAI GPT-4o mini            |    $0.15   |     $0.60   | 128K
OpenAI o3                     |   $10.00   |    $40.00   | 200K
OpenAI o4-mini                |    $1.10   |     $4.40   | 200K
Anthropic Claude Opus 4       |   $15.00   |    $75.00   | 200K
Anthropic Claude Sonnet 4.6   |    $3.00   |    $15.00   | 200K
Anthropic Claude Haiku 4.5    |    $0.80   |     $4.00   | 200K
Google Gemini 1.5 Pro         |    $1.25   |     $5.00   | 2M
Google Gemini 1.5 Flash       |    $0.075  |     $0.30   | 1M
Google Gemini 2.0 Flash       |    $0.10   |     $0.40   | 1M
Llama 3.3 70B (Together AI)   |    $0.88   |     $0.88   | 128K
Llama 3.3 70B (Groq)          |    $0.59   |     $0.79   | 128K
Mistral Large                 |    $2.00   |     $6.00   | 128K
Mistral Small                 |    $0.10   |     $0.30   | 128K

Note: Batch API pricing (async processing) is typically 50% off for providers that support it. Prompt caching reduces input costs by 75–90% for repeated prefixes on Anthropic and OpenAI.

The Real Cost Drivers

Output-to-input ratio. Output tokens cost 3–5x more than input tokens on most providers. Applications with long generated responses have much higher effective costs than those with short responses. Prompting for conciseness and controlling max_tokens reduces output length more effectively than shopping for cheaper input rates.

Context length. Every token in the context window costs money — system prompt, conversation history, retrieved documents. A 10,000-token system prompt across 100,000 daily requests adds 1 billion input tokens per day. Prompt caching eliminates most of this cost for shared prefixes — often the highest-leverage cost optimisation available.

Batching. Making one API call per document in a 1,000-document batch is far more expensive than using the batch API. Most providers offer 50% discounts for async batch workloads where results are not needed immediately.

Cost Per Task: A Practical Comparison

Per-token pricing is hard to reason about abstractly. Here are cost estimates for common tasks at different model tiers (typical input/output lengths):

Task                    | Tokens (in/out) | GPT-4o   | Haiku 4.5 | Gemini Flash
------------------------|-----------------|----------|-----------|-------------
Simple Q&A              | 200 / 100       | $0.0015  | $0.0006   | $0.00006
Document summary        | 2000 / 500      | $0.010   | $0.0036   | $0.00035
Code generation         | 500 / 1000      | $0.011   | $0.0048   | $0.00055
Data extraction (JSON)  | 1000 / 200      | $0.0045  | $0.0012   | $0.00011
Long doc analysis       | 10000 / 1000    | $0.035   | $0.012    | $0.0015

At 100,000 requests per month for document summarisation:

GPT-4o:        $1,000/month
Haiku 4.5:       $360/month
Gemini Flash:     $35/month
Groq Llama 70B:  $100/month

Gemini Flash’s pricing is exceptionally aggressive, making it compelling for high-volume workloads where its quality is sufficient. For many classification, extraction, and summarisation tasks, Flash quality is indistinguishable from GPT-4o — at 20–30x lower cost.

Choosing the Right Tier

Tier 1 — Nano/Flash ($0.05–$0.60 per million output tokens): GPT-4o mini, Claude Haiku, Gemini Flash, Mistral Small, Groq Llama 70B. Use for classification, sentiment analysis, simple Q&A, keyword extraction, format conversion, basic summarisation. These handle the vast majority of enterprise NLP tasks at 10–50x lower cost than frontier models.

Tier 2 — Standard ($3–$15 per million output tokens): GPT-4o, Claude Sonnet, Gemini Pro, Mistral Large. Use for complex summarisation, multi-step reasoning, code generation, nuanced analysis, customer-facing chat. The right default for most production applications.

Tier 3 — Frontier ($15–$75+ per million output tokens): Claude Opus, o3, GPT-4o with extended thinking. Reserve for the hardest reasoning tasks, complex code architecture, and decisions where a wrong answer has significant consequences. Should represent a small fraction of total request volume.

Routing requests to the appropriate tier is the most impactful cost optimisation for applications serving varied query complexity. A system routing 70% of requests to Tier 1, 25% to Tier 2, and 5% to Tier 3 costs roughly 85% less than sending everything to Tier 2, with minimal quality impact for most users.

Embeddings Cost Comparison

Embedding costs are often overlooked but can be significant for large corpora:

Model                         | Price per 1M tokens
------------------------------|--------------------
OpenAI text-embedding-3-small |    $0.020
OpenAI text-embedding-3-large |    $0.130
Google text-embedding-004     |    $0.025
Voyage voyage-3               |    $0.060
Cohere embed-v3               |    $0.100
BGE-M3 (self-hosted)          |    ~$0.001 (compute only)

For a corpus of 10 million chunks at 512 tokens each (5 billion tokens), embedding costs range from $100 (text-embedding-3-small) to $650 (text-embedding-3-large). This is a one-time cost if you cache embeddings persistently — but if you re-embed frequently, it compounds. Self-hosted BGE-M3 at roughly $0.001/million tokens is 20–130x cheaper than commercial APIs for large-scale workloads.

A Cost Estimation Framework

def estimate_monthly_cost(
    daily_requests: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_price_per_million: float,
    output_price_per_million: float,
    cache_hit_rate: float = 0.0,
    cached_prefix_tokens: int = 0
) -> dict:
    monthly = daily_requests * 30
    effective_input = avg_input_tokens - (cached_prefix_tokens * cache_hit_rate)
    input_cost = (effective_input * monthly / 1_000_000) * input_price_per_million
    output_cost = (avg_output_tokens * monthly / 1_000_000) * output_price_per_million
    return {"total": round(input_cost + output_cost, 2)}

# 10K daily requests, 1000 input / 500 output — Claude Sonnet vs Gemini Flash
print(estimate_monthly_cost(10_000, 1_000, 500, 3.0, 15.0))   # $2,430
print(estimate_monthly_cost(10_000, 1_000, 500, 0.10, 0.40))  # $90

Running this calculation with accurate token estimates from real request samples before committing to a model is one of the most valuable pre-launch exercises. The difference between tiers is often 10–100x, and choosing correctly is worth spending an hour on.

Open Source on Hosted Inference vs. Self-Hosted

Hosted open-source inference (Together AI, Groq, Fireworks) offers a middle path — cost advantages of open-source models without managing GPU infrastructure. Llama 3.3 70B on Groq at $0.59/$0.79 per million tokens is roughly 10x cheaper than Claude Sonnet for comparable quality on many tasks. The trade-off: data still goes to a third party, ruling it out for data residency requirements, and quality on edge cases and complex reasoning is generally lower than frontier models.

Self-hosted open-source (vLLM on your own GPU hardware) is cheaper still at scale. Break-even against managed APIs typically occurs at 300–500 million output tokens per month for cloud GPU instances, faster for on-premises hardware.

Hidden Costs Beyond Token Pricing

Retry overhead: Rate limit errors trigger retries. Without proper exponential backoff, a brief rate limit event can generate 5–10x the intended request volume. Development and testing: Testing against the production API easily generates millions of tokens before launch. Use cheaper models for development. Evaluation: Running LLM-as-judge evaluations adds a second layer of model costs — use a cheaper model for evaluation where possible. Observability tooling: LangSmith, Langfuse, and similar tools have their own pricing tiers that add to total cost of ownership.

Negotiating Enterprise Pricing

All major providers offer volume discounts not published on their public pricing pages. Once your monthly spend exceeds roughly $5,000–$10,000, contact your provider’s sales team. Discounts of 20–50% are common for committed volume agreements. Anthropic, OpenAI, and Google all offer annual commitment discounts and custom rate cards for large customers. Even without formal commitments, some providers offer automatic discounts at volume thresholds — check the pricing page carefully for tiered pricing structures before assuming the published rate is what you pay at scale. Combined with prompt caching, smart model routing, and batch API usage, enterprise-scale organisations routinely reduce their effective per-token cost by 60–80% compared to naive on-demand usage at published list prices.

Prompt Caching: The Biggest Lever

For applications with a substantial shared prefix — a long system prompt, a product knowledge base injected on every call, few-shot examples — prompt caching is consistently the highest-leverage cost optimisation. Both Anthropic and OpenAI offer it, and the economics are dramatic: cached tokens cost 10–25% of the normal input token price.

A concrete example: a customer support bot with a 5,000-token system prompt (product documentation, policies, tone guidelines) making 50,000 calls per day. Without caching, that is 250 million input tokens per day from the system prompt alone — at Claude Sonnet pricing, roughly $750 per day, $22,500 per month just for the system prompt. With prompt caching and a 90% hit rate, that drops to $2,250 per month. A single caching configuration change saves $20,000 per month. At scale, prompt caching often delivers more savings than any model switch or architecture change, and it requires almost no code — just adding a cache_control breakpoint to your system prompt.

Measuring What You Actually Spend

Many teams lack visibility into their actual token spend breakdown until a surprise bill arrives. A few practices prevent this. Log token usage from every API response — the usage object in every response contains exact input and output token counts. Aggregate these by model, endpoint, and user segment. Set budget alerts in your cloud billing console and at the provider level where available. Build a simple cost dashboard that shows daily spend trending against your monthly budget, broken down by application component. The investment in this instrumentation is a few hours of engineering time and saves far more when it catches a runaway cost anomaly — a prompt that grew unexpectedly long, a caching configuration that stopped working, or a traffic spike that inflated costs before anyone noticed.

When to Switch Providers vs. When to Optimise

Switching providers is often the last lever to pull, not the first. Before switching, exhaust the optimisations available within your current provider: enable prompt caching, implement model routing, use the batch API for async workloads, tighten max_tokens, and audit your prompts for unnecessary verbosity. These changes are lower-risk than switching providers — they do not affect model quality and do not require migrating application code. In aggregate they can reduce costs by 50–70% without changing the model at all.

Switch providers when you have exhausted in-provider optimisations and the quality-adjusted cost comparison genuinely favours a different provider for your specific workload. Run the comparison empirically on a sample of real queries — not just on benchmark scores — before migrating. Switching providers introduces integration risk, requires re-evaluating quality on your specific task distribution, and may require changes to your application’s retry logic, error handling, and monitoring. These costs are real and should be factored into the comparison alongside the per-token savings.

The Cost Optimisation Priority Order

When facing LLM cost pressure, address items in this order for the best return on engineering effort. First, enable prompt caching for any shared prefix over a few hundred tokens — usually a one-line change with potentially massive savings. Second, implement model routing to send simple queries to cheaper models while reserving frontier models for genuinely complex tasks. Third, use the batch API for any workload that does not need real-time responses — documents, nightly jobs, analysis pipelines. Fourth, audit and tighten max_tokens limits across all endpoints. Fifth, cache application-level responses for repeated identical queries. Sixth, evaluate whether open-source hosted inference (Groq, Together AI) matches quality at lower cost for your specific task. Seventh, consider self-hosted deployment if volume is high enough for the economics to work. Following this order systematically, most teams find they can reduce LLM API costs by 60–80% before reaching the point where infrastructure decisions like self-hosting become necessary.

LLM pricing is changing rapidly — new models enter the market at lower price points regularly, and existing providers cut prices as they achieve greater efficiency. The competitive dynamics between OpenAI, Anthropic, and Google have driven frontier model prices down by 90%+ in three years. Whatever your current cost structure looks like, revisit the comparison every six months. The cheapest capable model for your workload today may not be the cheapest six months from now, and staying current on the pricing landscape is worth the occasional hour it takes to update your analysis.

Multimodal and Specialised Model Pricing

Text completions are not the only cost to track. Vision inputs, audio transcription, and image generation each have their own pricing structures that can surprise teams building multimodal applications. Image inputs to vision models are priced per image or per tile (OpenAI charges per 512×512 tile for high-resolution images), typically equivalent to 85–1,700 input tokens depending on image size and detail level. At high image volumes, image input costs can exceed text token costs. Audio transcription (Whisper API) is priced per minute of audio — a call centre application transcribing thousands of hours of calls per month can have transcription costs that rival the LLM inference costs for processing those transcripts. Image generation (DALL-E, Imagen) is priced per image. Factor all modalities into your cost model for any application that processes more than text.

Leave a Comment