LLM Context Window Explained: Tokens, Limits, and How to Work Within Them

What Is a Context Window?

The context window is the total amount of text — measured in tokens — that a language model can see and process at one time. Everything the model uses to generate a response must fit inside it: the system prompt, the conversation history, any documents you have injected, the user’s message, and the response itself. Tokens that fall outside the window are invisible to the model, as if they never existed.

This is not a storage limit or a memory constraint in the traditional computing sense. The model does not gradually forget things as the conversation grows — it simply cannot see anything beyond the boundary of its current context window. The moment a message or document is pushed out of the window by newer content, it is gone from the model’s perspective entirely.

Understanding context windows is one of the most practically important concepts for anyone building with LLMs. It determines what you can fit in a single call, how you need to design multi-turn conversations, how much it costs to run the model, and how reliably the model will use information you provide.

What Is a Token?

Models do not process text character by character or word by word — they process tokens. A token is a chunk of text produced by a tokeniser, typically corresponding to a common word, a word fragment, or a punctuation mark. The exact mapping depends on the tokeniser the model uses, but as a rough rule of thumb: one token is approximately four characters of English text, or about three-quarters of a word. A page of text is roughly 500–600 tokens. A short novel is roughly 100,000 tokens.

Different languages tokenise differently. English tends to be efficient — common words map to single tokens. Languages with complex morphology, non-Latin scripts, or less representation in training data often tokenise less efficiently, sometimes using two to five times as many tokens for equivalent content. Code tokenises somewhere in between: identifiers and keywords are usually single tokens, but verbose languages with long variable names can be less efficient.

You can check the token count of any text using a tokeniser library. For OpenAI models, the tiktoken library is the standard tool:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = "The context window is the total amount of text a model can process at once."
tokens = enc.encode(text)
print(f"{len(tokens)} tokens")

For Anthropic models, the client library exposes a token counting method:

import anthropic

client = anthropic.Anthropic()
response = client.messages.count_tokens(
    model="claude-opus-4-6",
    messages=[{"role": "user", "content": "Your text here"}]
)
print(f"{response.input_tokens} tokens")

Context Window Sizes in 2026

Context windows have grown dramatically over the past few years and continue to expand. As of 2026, the major models offer the following approximate context sizes. GPT-4o supports 128,000 tokens. Claude Opus 4 and Claude Sonnet 4 support 200,000 tokens. Gemini 1.5 Pro supports 1,000,000 tokens, and Gemini 1.5 Flash supports 1,000,000 tokens as well. Most capable open-source models support between 8,000 and 128,000 tokens depending on the variant and how they were fine-tuned.

These numbers sound large, and for many tasks they are. A 200,000-token context window can comfortably hold an entire codebase, a book-length document, or hundreds of conversation turns. But context windows still impose real constraints in practice, for reasons that go beyond the raw token count.

The Lost-in-the-Middle Problem

A larger context window does not mean the model pays equal attention to everything in it. Research has consistently shown that models are better at using information near the beginning and end of the context window than information buried in the middle. This is known as the “lost-in-the-middle” problem, and it has real implications for how you structure your prompts.

If you are injecting a set of documents for the model to reason over, placing the most important documents first or last — rather than somewhere in the middle of a long list — tends to produce more reliable results. Similarly, if you have a critical instruction that must be followed, repeating it at the end of the prompt (near where the model begins generating) reinforces it more effectively than stating it once in the middle of a long system prompt.

The practical implication is that a large context window is not a free pass to stuff in as much information as possible. Quality and placement of context matter as much as quantity. A well-structured 50,000-token prompt will often outperform a carelessly assembled 150,000-token one.

Context Window and Cost

Most LLM APIs charge per token — separately for input tokens and output tokens. Input tokens are usually cheaper, but they add up quickly when you are injecting large documents or maintaining long conversation histories. A 200,000-token call costs roughly four times as much as a 50,000-token call, all else being equal.

This creates a real economic incentive to be thoughtful about what you put in the context window. A few patterns help keep costs manageable. Truncating conversation history to the most recent N turns rather than keeping the full history prevents unbounded growth in multi-turn applications. Chunking and retrieving only the relevant portions of large documents — via RAG — is far more cost-effective than injecting an entire knowledge base. Caching frequently-used system prompts using prompt caching APIs (available from both OpenAI and Anthropic) can significantly reduce costs when the same long prefix is reused across many calls.

Techniques for Working Within Context Limits

Even with large context windows, there are situations where your content exceeds what fits — or where you want to be more efficient. Several techniques help.

Sliding window / truncation. For conversational applications, keep only the most recent N tokens of conversation history, discarding older turns. This is the simplest approach and works well when each turn is relatively self-contained. The risk is losing important context established early in the conversation — a user’s stated preference or a key piece of information they provided at the start.

Summarisation. When the conversation grows long, periodically summarise older turns into a compact summary and replace them with it. The model sees the summary rather than the raw history. This preserves the semantic content of earlier turns while dramatically reducing token count. The tradeoff is that details are lost in the summarisation — fine for most conversations, but not for tasks where exact wording from earlier turns matters.

Retrieval-Augmented Generation (RAG). Instead of putting an entire knowledge base in the context window, store it in a vector database and retrieve only the chunks most relevant to the current query. This scales to arbitrarily large knowledge bases because you only ever inject a small, relevant subset. The tradeoff is retrieval latency and the risk of retrieving the wrong chunks.

Structured context management. For long agent runs, maintain a structured memory object — key facts, decisions made, tasks completed — rather than keeping the full raw history. Inject the memory object at the start of each new context rather than the complete turn-by-turn log. This is how most production agentic systems handle context growth.

Input vs. Output Tokens

The context window limit applies to the total of input and output tokens combined. If a model has a 128,000-token context window and you send a 120,000-token prompt, you have only 8,000 tokens left for the response. This matters for tasks that require long outputs — detailed reports, long code files, extended analyses. Always budget for the response length when calculating whether your content fits.

Most APIs let you set a max_tokens parameter to cap the output length. Setting this too low truncates responses mid-sentence. Setting it equal to the full context window minus your input length gives the model maximum room but may produce unnecessarily long outputs. A practical default is to set it to a reasonable upper bound for the expected response type — 1,000–2,000 tokens for conversational responses, 4,000–8,000 for detailed technical outputs — and increase it only when you observe truncation.

Prompt Caching

Both Anthropic and OpenAI support prompt caching for long, repeated prefixes. When you send the same system prompt and document set across many API calls, the provider caches the computed key-value representations of those tokens. Subsequent calls that share the cached prefix are processed faster and billed at a reduced rate — typically around 10–25% of the normal input token price for the cached portion.

To use Anthropic’s prompt caching, mark the sections you want cached with a cache-control breakpoint:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": your_large_document,
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": user_question
            }
        ]
    }
]

The cached content is stored for up to five minutes. For applications that process many queries against the same large document — a customer support bot trained on a product manual, a code assistant with a large codebase injected — prompt caching can reduce both latency and cost significantly.

Choosing the Right Context Window Size

Bigger is not always better. Larger context windows cost more, can introduce the lost-in-the-middle problem, and can tempt you into lazy prompt design — just stuffing everything in rather than thinking carefully about what the model actually needs. For most tasks, the relevant question is not “does my content fit?” but “what is the minimum context the model needs to do this task reliably?” Starting with less context and adding more when you observe failures is a more disciplined approach than defaulting to the maximum available window.

Context Windows and Multi-Modal Models

Context windows apply to all modalities, not just text. When you send an image to a multi-modal model, it is converted to tokens — typically 500 to 2,000 tokens per image depending on resolution and the model’s vision encoder. Sending ten high-resolution images in a single call can easily consume 15,000–20,000 tokens before you have written a single word of prompt. For applications that process many images, this makes context window management just as important as in text-heavy use cases.

Audio and video are handled differently by different providers. Some models transcribe audio to text first and then process the transcript; others encode audio directly into tokens. Either way, long audio clips consume substantial context — a one-hour podcast transcribed to text is roughly 30,000–40,000 tokens. Keep this in mind when designing pipelines that combine multiple modalities.

Monitoring Context Usage in Production

In production applications, it is valuable to track context window utilisation alongside other operational metrics. An application whose average prompt size is growing steadily over time is accumulating context debt — eventually prompts will exceed the limit, causing failures or requiring expensive architectural changes to fix. Tracking the 95th percentile prompt size, not just the average, gives you early warning before edge cases start causing failures.

Most API responses include token usage in the response metadata. Log these counts for every call and set alerts when utilisation exceeds a threshold — say, 80% of the context limit — so you have time to implement context management improvements before hitting the hard ceiling. Building context awareness into your application from the start is far easier than retrofitting it after you have a production system that assumes unlimited context.

Building context awareness into your application from the start is far easier than retrofitting it after you have a production system that silently assumes unlimited space. The context window is the most fundamental resource constraint in LLM application development — understanding it thoroughly pays dividends across every layer of your system design.