How to Get Reliable Structured Output from LLMs

Structured output generation — getting an LLM to reliably produce JSON, XML, or other parseable formats — is one of the most practically important capabilities for production LLM applications. Agentic workflows, data extraction pipelines, and tool-calling systems all depend on the model returning structured data that downstream code can parse without error handling for malformed output. The techniques for achieving reliable structured outputs span prompt engineering, constrained decoding, and fine-tuning, each with different reliability guarantees and trade-offs.

Why LLMs Struggle with Structured Output

LLMs are trained on next-token prediction over natural language text, not over formal grammars. The token distribution for natural language prose is very different from the token distribution for valid JSON — a model that has learned to write good English has not necessarily learned to close every bracket, quote every key, and avoid trailing commas. The problem compounds for nested structures: the probability of generating a syntactically valid JSON object of depth 3 with 10 fields is the product of the probabilities of each token being correct, and small per-token failure rates compound into frequent structural failures for complex schemas.

The failure modes are predictable: missing closing brackets, unquoted keys, trailing commas (invalid JSON), numeric values where strings are expected, missing required fields, and fields with the wrong type. Models also tend to add prose commentary before or after the JSON block unless explicitly instructed not to, and may wrap the output in markdown code fences that break naive JSON parsing. Addressing these failures systematically requires going beyond prompt engineering.

Prompt Engineering Foundations

The baseline for structured output is clear prompting: specify the exact JSON schema you want, provide a complete example, and instruct the model explicitly to return only the JSON with no surrounding text. System prompt placement matters — schema and format instructions in the system prompt are more reliably followed than the same instructions in the user message, because the model has been trained to treat system prompts as persistent behavioral constraints. For complex schemas, including both a JSON Schema definition and a filled-out example is more reliable than either alone.

Few-shot examples with the exact output format are more effective than schema descriptions for complex or unusual structures. If your target format has non-obvious conventions (a specific field ordering, unusual nesting, or custom value formats), 2–3 complete input/output examples demonstrating the exact format outperform a detailed schema description. Keep few-shot examples concise — token overhead for examples is real, and 3 examples at 200 tokens each adds 600 tokens to every request, which compounds at scale.

Constrained Decoding

Constrained decoding modifies the token sampling process to only allow tokens that are valid given the current state of the output according to a grammar or schema. At each decode step, a grammar engine computes which tokens would lead to a still-valid partial output, and the model’s logits are masked to zero for all invalid tokens before sampling. This provides a mathematical guarantee that the output conforms to the schema — not a probabilistic one, but a hard guarantee. Malformed JSON output becomes impossible rather than merely unlikely.

Outlines and Guidance are the main libraries for constrained decoding with local models. Outlines supports JSON Schema, Pydantic models, regular expressions, and context-free grammars as constraints, and integrates with HuggingFace Transformers and vLLM. vLLM’s guided decoding feature (–guided-decoding-backend=outlines or lm-format-enforcer) enables constrained decoding for served models via the OpenAI-compatible API by passing the schema in the guided_json parameter.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="token")

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "labels": {"type": "array", "items": {"type": "string"}}
    },
    "required": ["name", "confidence", "labels"]
}

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Classify this document: ..."}],
    extra_body={"guided_json": schema}
)
# response is guaranteed to parse as valid JSON matching schema

For API-based models (GPT-4o, Claude, Gemini), constrained decoding isn’t exposed directly, but structured output modes approximate it. OpenAI’s response_format with json_schema and strict=True uses server-side constrained decoding. Anthropic’s tool use feature treats the output as a function call with a defined input schema, which reliably produces structured outputs through a similar mechanism. These API-level structured output features are the equivalent of constrained decoding for hosted models and should be used instead of prompt engineering alone whenever reliable structure is required.

Pydantic Integration

Defining output schemas with Pydantic and using instructor (the library built around OpenAI’s structured output API) is the most ergonomic approach for Python applications. instructor wraps the OpenAI client to accept Pydantic models as response_model, handles the schema generation and API call, and returns a validated Pydantic object rather than raw JSON. Validation errors trigger automatic retries with the error message included, further improving reliability.

import instructor
from pydantic import BaseModel, Field
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class ExtractionResult(BaseModel):
    company_name: str
    founded_year: int = Field(ge=1800, le=2030)
    employee_count: int | None
    headquarters_city: str

result = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=ExtractionResult,
    messages=[{"role": "user", "content": "Extract company info: Anthropic was founded in 2021 in San Francisco..."}]
)
# result is a validated ExtractionResult instance

When to Fine-Tune for Structure

For applications making millions of structured output calls per month, fine-tuning a smaller model specifically for your output schema is often more cost-effective than using a large model with constrained decoding. A fine-tuned 8B model that reliably produces your specific JSON format at 95%+ accuracy may cost 10–20x less per call than GPT-4o with structured outputs, while being faster and fully under your control. Fine-tuning for structured output requires only a few hundred to low thousands of (input, correctly-formatted-output) examples — the model is learning a pattern, not new knowledge. Combine fine-tuning with constrained decoding at inference time for the highest reliability: the fine-tuned model places high probability mass on valid tokens, and constrained decoding ensures no structurally invalid token is ever emitted.

Validation and Retry Logic

Even with constrained decoding or API-level structured outputs, validation of field values (not just syntax) requires application-level checks. A model can produce syntactically valid JSON where a confidence score is 1.7, a year is 9999, or a required field contains an empty string — all syntactically valid but semantically wrong. Pydantic validators, JSON Schema validation with jsonschema, or custom field-level checks should be applied to every structured output before it enters downstream processing. For high-stakes extractions, implement a retry loop: if validation fails, send the model the failed output and the validation error message and ask it to correct the specific issue. One retry resolves the large majority of semantic validation failures without requiring a full regeneration from scratch.

Structured Output for Agentic Workflows

Agentic systems that chain multiple LLM calls depend critically on structured output at each step. When one LLM call’s output is parsed and fed as input to the next step, a parsing failure anywhere in the chain causes the entire workflow to fail. This makes reliability requirements much more stringent than for standalone LLM calls: a 95% JSON parse success rate sounds good but means 5% of requests fail, and in a 10-step chain the chain completion rate is 0.95^10 ≈ 60%. For reliable agent workflows, structured output success rates need to be above 99% per step, which requires constrained decoding or API-level structured outputs rather than prompt engineering alone.

Tool calling in modern LLM APIs (OpenAI function calling, Anthropic tool use, Gemini function declarations) is implemented as structured output under the hood. The model generates a structured tool call specification rather than free text, and the API validates it against the declared tool schema before returning. Using the native tool calling API is always preferable to asking the model to output a JSON blob representing a tool call — the native API provides schema validation, cleaner model behavior (the model has been fine-tuned specifically on the tool calling format), and often reduced latency because tool outputs are handled in a separate path from the main generation.

Streaming and Partial Output Handling

Streaming structured outputs — returning tokens to the caller as they’re generated rather than waiting for the complete response — requires special handling because partial JSON is not valid JSON and cannot be parsed until the stream is complete. For latency-sensitive applications that use streaming, consider designing schemas that can be parsed incrementally: arrays of objects can be parsed as each complete object arrives, and flat objects with independent fields can partially update a UI as fields complete. Partial JSON parsers like ijson (Python) handle streaming JSON parsing without buffering the full response. For strictly latency-sensitive applications where time to first byte matters, streaming with partial parsing is worth the implementation complexity; for batch processing or background tasks where total latency matters but not streaming latency, buffering the complete response and parsing once is simpler and equally fast.

Structured Output in Multi-Turn Conversations

Multi-turn conversational applications face an additional structured output challenge: the model must produce structured output consistently across all turns of a conversation, not just in the first response. As conversation history grows, the model may drift away from the specified format — earlier turns introduce natural language patterns that influence later token predictions. Reinforce the format constraint in the system prompt at every turn (not just the first), and consider placing a brief format reminder immediately before each assistant turn in the conversation history. For applications where structured output is critical on every turn, constrained decoding at the API level is more reliable than prompt-based format reinforcement for long conversation histories.

Conversation state that must persist across turns — extracted entities, confirmed facts, user preferences — should be stored in structured format explicitly in the system prompt or as a dedicated state block in the context, rather than relying on the model to remember and reproduce it from conversation history. Explicit state management with structured output at each turn (the model updates a JSON state object as part of its response) is more reliable than implicit state tracking through natural language, and makes the application’s behavior auditable and debuggable. This pattern — structured state updated at each turn alongside a natural language response — is the basis of most production multi-turn LLM applications beyond simple chatbots.

Leave a Comment