How to Count Tokens and Estimate LLM Costs Before You Ship

LLM API costs are easy to underestimate before shipping. A prompt that costs 0.3 cents in development becomes significant at scale: 10,000 requests per day at 500 input tokens and 200 output tokens adds up to tens of thousands of dollars per month at Sonnet or GPT-4o pricing. Getting this wrong is a common and painful surprise. The tools and techniques for counting tokens accurately, estimating costs across providers, and building pre-flight checks into your pipeline are not complex — they just require knowing which libraries to use and what to measure. This article covers everything you need to go from a rough cost estimate to a production-ready token budget system.

Counting Tokens with tiktoken

tiktoken is OpenAI’s tokeniser library, released as open source, that implements the BPE tokenisation used by GPT-3.5, GPT-4, GPT-4o, and o-series models. It is the authoritative tool for counting tokens for these models — not approximate word counts or character-divided-by-4 heuristics.

import tiktoken

# Load the encoding for a specific model
enc_gpt4o = tiktoken.encoding_for_model("gpt-4o")
enc_gpt4 = tiktoken.encoding_for_model("gpt-4")
enc_35 = tiktoken.encoding_for_model("gpt-3.5-turbo")

# Count tokens in a string
text = "Gradient checkpointing reduces memory by recomputing activations during the backward pass."
tokens = enc_gpt4o.encode(text)
print(f"Token count: {len(tokens)}")  # 16
print(f"Tokens: {tokens}")

# Count tokens for a full chat message list (includes special tokens for roles)
def count_chat_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """Count tokens for an OpenAI-format chat messages list.
    
    Accounts for the per-message overhead (role tokens, separator tokens).
    Formula from OpenAI's tiktoken documentation.
    """
    enc = tiktoken.encoding_for_model(model)
    # Token overhead per message varies slightly by model
    if "gpt-4o" in model or "gpt-4" in model:
        tokens_per_message = 3  # <|im_start|>{role}
{content}<|im_end|>
        tokens_per_name = 1
    elif "gpt-3.5-turbo" in model:
        tokens_per_message = 4
        tokens_per_name = -1
    else:
        tokens_per_message = 3
        tokens_per_name = 1

    total = 0
    for msg in messages:
        total += tokens_per_message
        for key, val in msg.items():
            total += len(enc.encode(str(val)))
            if key == "name":
                total += tokens_per_name
    total += 3  # reply priming tokens
    return total

messages = [
    {"role": "system", "content": "You are a helpful ML engineering assistant."},
    {"role": "user", "content": "Explain gradient checkpointing in two sentences."},
]
input_tokens = count_chat_tokens(messages, model="gpt-4o")
print(f"Input token count: {input_tokens}")

One important caveat: tiktoken only covers OpenAI models. For other providers, you need different tools. Anthropic models use a different tokeniser entirely — Claude’s tokeniser is based on a SentencePiece BPE trained on different data than GPT’s cl100k or o200k encodings, so the same text will produce different token counts. For rough estimation purposes, the token counts are usually within 10–15% of each other for English text, but for precise cost calculation you should use the provider-appropriate tokeniser.

Counting Tokens for Anthropic Models

import anthropic

client = anthropic.Anthropic()

def count_tokens_anthropic(messages: list[dict], system: str = "",
                            model: str = "claude-sonnet-4-20250514") -> dict:
    """Use Anthropic's token counting API endpoint for exact counts."""
    response = client.messages.count_tokens(
        model=model,
        system=system,
        messages=messages,
    )
    return {"input_tokens": response.input_tokens}

messages = [{"role": "user", "content": "Explain gradient checkpointing in two sentences."}]
system = "You are a helpful ML engineering assistant."
counts = count_tokens_anthropic(messages, system=system)
print(f"Input tokens (Anthropic): {counts['input_tokens']}")

# For HuggingFace / open models: use the model's own tokenizer
from transformers import AutoTokenizer

def count_tokens_hf(text: str, model_name: str) -> int:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return len(tokenizer.encode(text))

# Llama tokenizer is different from GPT tokenizer
llama_tokens = count_tokens_hf(
    "Gradient checkpointing reduces memory by recomputing activations.",
    "meta-llama/Llama-3.2-1B"
)
print(f"Llama token count: {llama_tokens}")

Building a Cost Estimator

With accurate token counts, you can build a cost estimator that runs against a representative sample of your production requests before you ship — catching expensive edge cases before they reach your billing dashboard.

from dataclasses import dataclass
from typing import Optional
import tiktoken

@dataclass
class ModelPricing:
    """Per-million-token pricing in USD."""
    input_per_mtok: float
    output_per_mtok: float
    cache_write_per_mtok: Optional[float] = None   # prompt caching write cost
    cache_read_per_mtok: Optional[float] = None    # prompt caching read cost (much cheaper)

# Current pricing as of early 2026 (always verify at provider docs)
PRICING = {
    # Anthropic
    "claude-sonnet-4-20250514":  ModelPricing(input_per_mtok=3.0,  output_per_mtok=15.0,
                                               cache_write_per_mtok=3.75, cache_read_per_mtok=0.30),
    "claude-haiku-4-5-20251001": ModelPricing(input_per_mtok=0.80, output_per_mtok=4.0,
                                               cache_write_per_mtok=1.00, cache_read_per_mtok=0.08),
    # OpenAI
    "gpt-4o":                    ModelPricing(input_per_mtok=2.50, output_per_mtok=10.0,
                                               cache_read_per_mtok=1.25),
    "gpt-4o-mini":               ModelPricing(input_per_mtok=0.15, output_per_mtok=0.60,
                                               cache_read_per_mtok=0.075),
}

def estimate_cost(
    input_tokens: int,
    output_tokens: int,
    model: str,
    cached_input_tokens: int = 0,
) -> dict:
    """Estimate API cost for a single request."""
    if model not in PRICING:
        raise ValueError(f"No pricing data for model: {model}")
    p = PRICING[model]
    # Non-cached input tokens
    uncached_input = input_tokens - cached_input_tokens
    input_cost  = (uncached_input    / 1_000_000) * p.input_per_mtok
    output_cost = (output_tokens     / 1_000_000) * p.output_per_mtok
    cache_cost  = 0.0
    if cached_input_tokens > 0 and p.cache_read_per_mtok:
        cache_cost = (cached_input_tokens / 1_000_000) * p.cache_read_per_mtok
    total = input_cost + output_cost + cache_cost
    return {
        "input_cost_usd":  round(input_cost,  6),
        "output_cost_usd": round(output_cost, 6),
        "cache_cost_usd":  round(cache_cost,  6),
        "total_cost_usd":  round(total,       6),
    }

# Estimate cost for a batch of requests
def estimate_batch_cost(requests: list[dict], model: str) -> dict:
    """
    requests: list of {"input_tokens": int, "output_tokens": int, "cached_input_tokens": int}
    """
    totals = {"input_cost_usd": 0, "output_cost_usd": 0, "cache_cost_usd": 0, "total_cost_usd": 0}
    for req in requests:
        costs = estimate_cost(
            req["input_tokens"], req["output_tokens"], model,
            req.get("cached_input_tokens", 0)
        )
        for k in totals:
            totals[k] += costs[k]
    totals["request_count"] = len(requests)
    totals["avg_cost_per_request_usd"] = round(totals["total_cost_usd"] / len(requests), 6)
    return totals

# Example: estimate daily cost from a sample of 100 logged requests
sample_requests = [
    {"input_tokens": 450, "output_tokens": 180, "cached_input_tokens": 200},
    {"input_tokens": 820, "output_tokens": 350, "cached_input_tokens": 0},
    # ... more sampled requests
]
daily_cost = estimate_batch_cost(sample_requests, "claude-sonnet-4-20250514")
print(f"Avg cost per request: {daily_cost['avg_cost_per_request_usd']:.4f} USD")
print(f"Projected daily cost (10k req): {daily_cost['avg_cost_per_request_usd'] * 10000:.2f} USD")

Pre-Flight Token Budget Checks

Once you can count tokens accurately, it is worth adding a pre-flight check to your LLM call wrapper that validates the request against the model’s context window and your per-request cost budget before making the API call. This catches two classes of expensive mistakes: requests that will fail because they exceed the context window (resulting in a paid API error that contributes no value), and requests that are unusually large due to a prompt construction bug or unexpectedly long user input.

import tiktoken
from dataclasses import dataclass

@dataclass
class TokenBudget:
    max_input_tokens: int
    max_output_tokens: int
    max_cost_usd: float

MODEL_CONTEXT_WINDOWS = {
    "claude-sonnet-4-20250514": 200_000,
    "claude-haiku-4-5-20251001": 200_000,
    "gpt-4o": 128_000,
    "gpt-4o-mini": 128_000,
}

def preflight_check(
    messages: list[dict],
    model: str,
    budget: TokenBudget,
    expected_output_tokens: int = 500,
) -> dict:
    """Run token and cost checks before making an API call.
    
    Returns {"ok": bool, "warnings": list[str], "errors": list[str], "input_tokens": int}
    """
    warnings, errors = [], []
    # Count input tokens
    if "claude" in model:
        import anthropic
        client = anthropic.Anthropic()
        input_tokens = client.messages.count_tokens(model=model, messages=messages).input_tokens
    else:
        input_tokens = count_chat_tokens(messages, model=model)
    # Check context window
    context_limit = MODEL_CONTEXT_WINDOWS.get(model, 128_000)
    total_tokens = input_tokens + expected_output_tokens
    if total_tokens > context_limit:
        errors.append(f"Total tokens ({total_tokens:,}) exceed context window ({context_limit:,})")
    elif total_tokens > context_limit * 0.9:
        warnings.append(f"Approaching context limit: {total_tokens:,}/{context_limit:,} tokens")
    # Check input budget
    if input_tokens > budget.max_input_tokens:
        errors.append(f"Input tokens ({input_tokens:,}) exceed budget ({budget.max_input_tokens:,})")
    # Check cost budget
    cost = estimate_cost(input_tokens, expected_output_tokens, model)
    if cost["total_cost_usd"] > budget.max_cost_usd:
        errors.append(f"Estimated cost ({cost['total_cost_usd']:.4f} USD) exceeds budget ({budget.max_cost_usd:.4f} USD)")
    return {
        "ok": len(errors) == 0,
        "warnings": warnings,
        "errors": errors,
        "input_tokens": input_tokens,
        "estimated_cost_usd": cost["total_cost_usd"],
    }

# Use in your LLM wrapper
budget = TokenBudget(max_input_tokens=50_000, max_output_tokens=4_000, max_cost_usd=0.05)
check = preflight_check(messages, "claude-sonnet-4-20250514", budget)
if not check["ok"]:
    raise ValueError(f"Pre-flight check failed: {check['errors']}")
if check["warnings"]:
    print(f"Warnings: {check['warnings']}")

Tracking Token Usage in Production

All major LLM APIs return actual token counts in the response metadata — use these rather than pre-computed estimates for billing reconciliation, per-user cost attribution, and identifying expensive outlier requests. Logging this data consistently from day one is much easier than reconstructing it later.

import anthropic, json
from datetime import datetime

client = anthropic.Anthropic()

def tracked_completion(messages: list[dict], system: str = "",
                       model: str = "claude-sonnet-4-20250514",
                       user_id: str = "anonymous", **kwargs) -> dict:
    """Wrapper that logs token usage and cost for every API call."""
    response = client.messages.create(
        model=model, system=system, messages=messages,
        max_tokens=kwargs.get("max_tokens", 1024),
    )
    usage = response.usage
    cost = estimate_cost(usage.input_tokens, usage.output_tokens, model,
                         getattr(usage, "cache_read_input_tokens", 0))
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "model": model,
        "user_id": user_id,
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        "cache_read_tokens": getattr(usage, "cache_read_input_tokens", 0),
        "total_cost_usd": cost["total_cost_usd"],
    }
    # In production: write to your observability store (Datadog, BigQuery, etc.)
    print(json.dumps(log_entry))
    return {"content": response.content[0].text, "usage": log_entry}

result = tracked_completion(messages, system=system, user_id="user_123")

Aggregating this data by user, by prompt template, and by day gives you the information you need to set per-user spending caps, identify which features drive the majority of your LLM costs, and make informed decisions about model selection. Moving from Sonnet to Haiku for tasks where Haiku quality is sufficient typically reduces costs by 75–80% — but you can only make that call confidently if you have the token and cost data to benchmark the quality tradeoff on your actual traffic distribution.

Why Token Counts Differ Between Providers

A common source of cost estimation errors is assuming that token counts are consistent across providers. They are not. The same sentence tokenises differently in GPT-4o’s cl100k encoding, Claude’s SentencePiece vocabulary, and Llama’s BPE tokeniser because each was trained on different data with different vocabulary sizes and merge priorities. For typical English prose, the differences are small — usually within 10% — but for code, non-English text, and structured data like JSON or CSV, the differences can be 20–40%. A Python function with many special characters will tokenise quite differently depending on whether the tokeniser was trained to treat common code patterns as single tokens or split them more aggressively.

The practical implication is that you should always use the provider-specific token counting tool for cost estimates that inform budget decisions. Using tiktoken to estimate costs for Claude calls will underestimate or overestimate depending on the content type. For high-volume applications where you are trying to project monthly costs before deploying, using the Anthropic token counting API on a representative sample of your real prompts gives you a much more accurate number than any cross-provider approximation. The Anthropic token counting endpoint is free — it does not consume any compute beyond tokenisation — so there is no cost reason to avoid using it for pre-deployment estimation.

The Impact of Prompt Caching on Cost

Prompt caching is one of the most impactful cost reduction mechanisms for production LLM applications, and it requires understanding token counting to use correctly. When you structure prompts so that a long, stable prefix (system prompt, document context, conversation history) comes before the variable portion (the current user query), the provider can cache the token representations of the stable prefix and reuse them on subsequent requests without re-processing. Anthropic charges 3.75 USD per million tokens to write a cache entry and 0.30 USD per million tokens to read from cache — compared to 3.00 USD per million for standard input tokens. On the first request the cache write costs slightly more, but every subsequent cache hit costs 10x less than a fresh input token.

The break-even point for prompt caching is one additional request using the same prefix. If your system prompt and document context are 10,000 tokens and you handle 1,000 requests per day all using that same prefix, caching reduces your daily input token cost from 30 USD (10M tokens at 3.00/M) to roughly 3.40 USD (the initial cache write plus 999 cache reads) — a reduction of about 88%. This is only possible if you order your prompt correctly: the cacheable prefix must come first, before the variable user input. Getting the ordering right in your prompt template is a one-time change with a permanent cost impact that compounds at every scale.

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are an expert ML engineering assistant with deep knowledge of
PyTorch, HuggingFace, and production ML systems. [... long system prompt ...]"""

DOCUMENT_CONTEXT = """[... long document that is the same across many requests ...]"""

def cached_completion(user_query: str, model: str = "claude-sonnet-4-20250514") -> str:
    """Structure prompt to maximise cache hits: stable prefix first."""
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT + "

" + DOCUMENT_CONTEXT,
                "cache_control": {"type": "ephemeral"},  # mark prefix for caching
            }
        ],
        messages=[{"role": "user", "content": user_query}],
    )
    usage = response.usage
    cache_read = getattr(usage, "cache_read_input_tokens", 0)
    cache_write = getattr(usage, "cache_creation_input_tokens", 0)
    print(f"Input: {usage.input_tokens} | Cache read: {cache_read} | Cache write: {cache_write} | Output: {usage.output_tokens}")
    return response.content[0].text

Output Token Estimation

Input token counts are deterministic — the same prompt always produces the same token count. Output token counts are not: they depend on what the model generates, which varies with the prompt, temperature, and random sampling. For cost estimation, you need an empirical estimate of expected output length based on your actual use case. The right approach is to run a representative sample of 50–200 real requests (or requests from your evaluation set), measure the actual output token counts, and use the mean and 95th percentile as your planning numbers. Mean output tokens gives you expected cost; the 95th percentile gives you the right value for max_tokens in production — set it high enough that legitimate responses are not truncated, but not so high that runaway generation causes surprise costs.

Different task types have very different output distributions. Structured extraction tasks (extract JSON from a document) have predictable, bounded output lengths. Open-ended generation tasks (explain this concept) have high variance. Multi-step reasoning with chain-of-thought has much longer outputs than direct answer generation. If your application supports multiple task types, measure output length distributions separately for each and apply different max_tokens limits and cost estimates per task type rather than using a single universal value that is either too restrictive for verbose tasks or wastefully large for concise ones.

Cost Optimisation: Choosing the Right Model per Task

The single highest-leverage cost decision in most LLM applications is model selection per task. For a pipeline that mixes complex reasoning, simple extraction, and conversational responses, routing each task type to the cheapest model that meets the quality bar can reduce total API costs by 60–80% compared to using a single premium model for everything. The workflow is: define quality metrics for each task type, run Haiku (or GPT-4o-mini) on your evaluation set, measure quality, and promote to Sonnet or GPT-4o only for tasks where the smaller model fails to meet the quality threshold. Most simple extraction, classification, and formatting tasks are well within the capability of Haiku or GPT-4o-mini, while genuinely hard reasoning, code generation, and multi-step planning benefit from the larger models. Running this analysis upfront before committing to a model in production pays for itself immediately at any meaningful request volume.

Building a Token Usage Dashboard

Once you have per-request token logging in place, aggregating it into a lightweight cost dashboard helps you spot problems before they become expensive surprises. The key metrics to track are: daily token volume broken down by model and task type, average input and output tokens per request (to detect prompt bloat from bugs or regressions), cost per user or per feature for chargeback and quota enforcement, and the 95th and 99th percentile request costs to identify runaway requests that need max_tokens capping. You do not need a dedicated analytics platform for this at early scale — a simple daily job that reads your request logs, aggregates by the dimensions above, and writes a summary to a shared spreadsheet or Slack channel provides most of the value. The goal is to make token economics visible to the team before they become invisible line items on a cloud bill. Engineers who can see the token cost of a feature in near-real-time make different design decisions than engineers who get a monthly bill with no granular breakdown. Closing that feedback loop is the most important thing you can do to keep LLM operating costs predictable as you scale.

Common Token Counting Mistakes

Several recurring mistakes inflate token counts and costs in production. First, repeatedly including long static content — instructions, examples, schema definitions — in every message rather than using the system prompt or prompt caching. A 2,000-token instruction block sent with every user turn adds 2,000 tokens to every request; moving it to a cached system prompt converts those from 3.00/M tokens to 0.30/M on cache hits, reducing that portion of costs by 90%. Second, not truncating conversation history: many chat applications append the full conversation to every API call, so a long conversation doubles in cost at every turn as history accumulates. Truncating history to the last N turns or summarising old turns into a compact representation keeps context window usage bounded. Third, using verbose prompt formats: JSON payloads with long key names, verbose XML tags, and redundant whitespace all contribute tokens that carry no information value. Compact prompt formatting — short key names, minimal whitespace, concise instructions — can reduce input token counts by 10–20% with no change in output quality, and at scale that reduction compounds directly into cost savings.

Token counting is not glamorous infrastructure, but it is the foundation of predictable LLM economics. Accurate counts before deployment, cost logging in production, and periodic review of the aggregated data are the three practices that separate teams that understand their LLM costs from teams that are perpetually surprised by them.