Structured Outputs with LLMs: JSON Mode, Tool Forcing, and Pydantic

Why Structured Outputs Matter

LLMs produce free-form text by default, but most production applications need machine-readable output — a JSON object to store in a database, a specific set of fields to populate a UI, a validated data structure to pass to the next step in a pipeline. When a model produces prose where you expected JSON, your application breaks. When it includes extra fields, omits required ones, or uses the wrong data types, downstream processing fails in ways that are often silent and hard to debug.

Structured outputs are the techniques and APIs that constrain LLM responses to a predictable format. Getting this right is one of the highest-leverage reliability improvements you can make to an LLM application, because it eliminates an entire class of failures — format violations — that would otherwise require fragile post-processing or manual review to catch.

The Four Approaches to Structured Output

There are four main techniques for getting structured output from an LLM, in roughly increasing order of reliability.

Prompting alone tells the model to respond in a specific format using natural language: “Respond only with valid JSON. Do not include any explanation or preamble.” This works most of the time with capable models but fails often enough in production to require fallback handling. It is the right starting point before investing in more robust approaches.

JSON mode is a provider-level API setting that instructs the model to produce valid JSON syntax, enforced at the token level. It prevents malformed JSON but does not enforce a specific schema — the model still decides what fields to include and what types to use. Available in OpenAI’s API with "response_format": {"type": "json_object"} and equivalent settings in other providers.

Structured output APIs take this further by accepting a JSON Schema definition and enforcing that the model’s output matches it exactly — correct field names, correct types, no extra or missing required fields. OpenAI’s structured outputs and Anthropic’s tool-forcing pattern both implement this. This is the most reliable approach and should be the default for any application where format correctness matters.

Constrained decoding enforces structure at the token generation level, making format violations structurally impossible. Available for locally-hosted models via libraries like Outlines. The strongest possible guarantee, but limited to models you run yourself.

JSON Mode in Practice

JSON mode is the easiest starting point for OpenAI-compatible APIs:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Extract the key information from the support ticket. Respond with JSON only."},
        {"role": "user", "content": "My order #12345 hasn't arrived after 2 weeks. I need a refund urgently."}
    ]
)

import json
data = json.loads(response.choices[0].message.content)
print(data)

JSON mode requires you to mention JSON in your prompt — the API will error if you enable it without doing so. It produces valid JSON syntax but makes no promises about the schema, so validate the output against your expected structure before using it.

OpenAI Structured Outputs with JSON Schema

For schema enforcement, pass a JSON Schema definition to the response_format parameter:

from openai import OpenAI
from pydantic import BaseModel
from typing import Literal

client = OpenAI()

class TicketAnalysis(BaseModel):
    order_id: str | None
    issue_type: Literal["shipping", "billing", "product", "account", "other"]
    urgency: Literal["low", "medium", "high", "critical"]
    sentiment: Literal["positive", "neutral", "negative", "angry"]
    requires_refund: bool
    summary: str

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Analyse this support ticket and extract structured information."},
        {"role": "user", "content": "My order #12345 hasn't arrived after 2 weeks. I need a refund urgently."}
    ],
    response_format=TicketAnalysis,
)

ticket = response.choices[0].message.parsed
print(f"Issue: {ticket.issue_type}, Urgency: {ticket.urgency}, Refund: {ticket.requires_refund}")

The .parse() method automatically converts the model’s response into the Pydantic model. If the model’s output does not match the schema, the SDK raises a validation error rather than returning malformed data — a much cleaner failure mode than silently passing bad data downstream.

Tool Forcing with Anthropic Claude

Anthropic does not have a JSON mode equivalent, but tool forcing achieves the same result — and often better. Define a tool whose input schema matches the structure you want to extract, then set tool_choice to force the model to call it. The model’s response will always contain the tool call with validated parameters:

import anthropic
import json

client = anthropic.Anthropic()

tools = [{
    "name": "analyse_ticket",
    "description": "Extract structured information from a support ticket.",
    "input_schema": {
        "type": "object",
        "properties": {
            "order_id": {"type": "string", "description": "Order ID if mentioned, null otherwise"},
            "issue_type": {
                "type": "string",
                "enum": ["shipping", "billing", "product", "account", "other"]
            },
            "urgency": {
                "type": "string",
                "enum": ["low", "medium", "high", "critical"]
            },
            "sentiment": {
                "type": "string",
                "enum": ["positive", "neutral", "negative", "angry"]
            },
            "requires_refund": {"type": "boolean"},
            "summary": {"type": "string", "description": "One-sentence summary of the issue"}
        },
        "required": ["issue_type", "urgency", "sentiment", "requires_refund", "summary"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    tools=tools,
    tool_choice={"type": "tool", "name": "analyse_ticket"},
    messages=[{
        "role": "user",
        "content": "My order #12345 hasn't arrived after 2 weeks. I need a refund urgently."
    }]
)

# Extract the tool call arguments
tool_use = next(b for b in response.content if b.type == "tool_use")
data = tool_use.input
print(f"Issue: {data['issue_type']}, Urgency: {data['urgency']}")

Tool forcing is reliable because the model must populate every required field with a value that matches the enum or type constraint. It also works without any JSON mode flag and produces clean, parseable output even when the model would otherwise want to add explanatory prose around its answer.

Using Instructor for Automatic Retry and Validation

The instructor library wraps both OpenAI and Anthropic clients to add automatic retry on validation failure, making structured output extraction more robust without extra boilerplate:

import instructor
from anthropic import Anthropic
from pydantic import BaseModel, Field, field_validator
from typing import Literal

client = instructor.from_anthropic(Anthropic())

class TicketAnalysis(BaseModel):
    order_id: str | None = Field(None, description="Order ID if mentioned")
    issue_type: Literal["shipping", "billing", "product", "account", "other"]
    urgency: Literal["low", "medium", "high", "critical"]
    sentiment: Literal["positive", "neutral", "negative", "angry"]
    requires_refund: bool
    summary: str = Field(..., min_length=10, max_length=200)

    @field_validator("summary")
    @classmethod
    def summary_must_be_informative(cls, v):
        if v.lower().startswith("the customer"):
            return v
        raise ValueError("Summary must describe the customer's situation")

result = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{
        "role": "user",
        "content": "My order #12345 hasn't arrived after 2 weeks. I need a refund urgently."
    }],
    response_model=TicketAnalysis
)
# result is a validated TicketAnalysis — guaranteed to pass all validators

Instructor handles the tool definition, the API call, and the parsing automatically. If the model’s response fails Pydantic validation, it retries with an error message appended, giving the model a chance to self-correct. By default it retries up to three times before raising an exception.

Handling Nested and Complex Schemas

Structured output APIs handle nested schemas, arrays, and optional fields reliably in modern models. For complex extractions — a list of line items from an invoice, a tree of nested categories, a document with multiple sections — define the full nested schema and let the model fill it in:

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    vendor: str
    date: str
    line_items: list[LineItem]
    subtotal: float
    tax: float
    total: float
    currency: str = "USD"

invoice = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": f"Extract this invoice:\n{invoice_text}"}],
    response_model=Invoice
)

For very deep nesting or very large schemas (more than 20–30 fields), model accuracy can degrade — the model may populate less important fields carelessly. If you observe this, break the extraction into multiple smaller calls, each targeting a subset of the schema, and merge the results in application code.

Constrained Decoding for Local Models

For locally-hosted open-source models, constrained decoding with Outlines provides the strongest possible guarantee — format violations are structurally impossible because non-conforming tokens are zeroed out at generation time:

import outlines
import outlines.models as models
from pydantic import BaseModel
from typing import Literal

model = models.transformers("mistralai/Mistral-7B-Instruct-v0.1")

class TicketAnalysis(BaseModel):
    issue_type: Literal["shipping", "billing", "product", "other"]
    urgency: Literal["low", "medium", "high", "critical"]
    requires_refund: bool

generator = outlines.generate.json(model, TicketAnalysis)
result = generator("Analyse: My order hasn't arrived after 2 weeks.")
# result is guaranteed to be a valid TicketAnalysis — no retry needed

Constrained decoding is more reliable than API-based structured outputs and requires no retry logic, but it is only available for models you run yourself. It is particularly useful for high-volume, cost-sensitive pipelines where you want to run a smaller open-source model with a hard format guarantee.

Choosing the Right Approach

For API-hosted models in applications where reliability is critical, use structured output APIs (OpenAI’s .parse() or Anthropic’s tool forcing) with Pydantic validation and Instructor for retry handling. For applications that just need valid JSON without strict schema enforcement, JSON mode with post-hoc validation is simpler. For locally-hosted models, constrained decoding with Outlines eliminates the need for retry logic entirely. For rapid prototyping, prompting alone works well enough to validate the idea before investing in a more robust implementation. The rule of thumb is to use the simplest approach that reliably handles your actual failure rate in production — over-engineering structured output handling adds complexity without benefit if your simpler approach already achieves 99.9% reliability on your specific task.

Streaming and Structured Outputs

One practical limitation of structured output APIs is that they are incompatible with streaming — you cannot stream partial JSON tokens to the client the way you can stream prose. For user-facing applications where responsiveness matters, this creates a trade-off: structured output mode means the user waits for the complete response before anything appears on screen.

Several patterns mitigate this. For short structured responses (under a few seconds of generation time), the delay is usually imperceptible. For longer responses, consider a two-phase approach: stream a prose response to the user in real time, then make a second non-streaming structured extraction call to parse the key fields for storage or processing in the background. The user sees a responsive interface; your application gets clean structured data. Alternatively, some applications use streaming for the user-visible portion of the response and structured extraction only for the metadata fields they need to store — routing the stream to the UI and the completed response to the extraction pipeline simultaneously.

Structured Outputs for Agentic Pipelines

Structured outputs become even more important in multi-step agentic pipelines, where the output of one model call is the input to the next. A format error early in a chain can cascade — corrupting the state that later steps depend on, causing subsequent calls to fail in ways that are hard to trace back to their source. Enforcing schema correctness at every step, rather than just at the pipeline’s final output, makes agentic systems dramatically easier to debug and more reliable in production.

Define Pydantic models for every inter-step message format in your pipeline. Use Instructor with retry for each extraction step. Log the validated output of each step so you can replay and debug failures from any point in the chain without re-running earlier steps. This discipline adds modest implementation overhead but pays for itself the first time you need to diagnose a failure in a 10-step agent pipeline — without per-step structured outputs, tracking down which step produced the corrupted state that caused the final failure can take hours.

Testing Structured Output Pipelines

Structured output pipelines need test coverage that goes beyond checking whether the final answer is correct. Test that every schema field is populated with a valid value across a representative set of inputs. Test edge cases where the model might struggle — very short inputs, inputs with no relevant information, inputs in unexpected languages or formats. Test the retry behaviour: does the pipeline correctly recover from an initial format failure? Test that validation errors surface cleanly rather than silently passing invalid data downstream.

A simple property-based testing approach generates random valid and invalid inputs and verifies that your pipeline produces correctly structured output for the valid inputs and raises appropriate errors for the invalid ones. Libraries like Hypothesis make this kind of generative testing easy to integrate into a standard pytest suite. Catching structured output failures in your test suite rather than in production is the difference between a reliable pipeline and one that requires constant manual monitoring.

Common Pitfalls and How to Avoid Them

A few failure patterns come up repeatedly when teams first start using structured outputs. Overly restrictive enums. If your enum values do not cover the full range of what the model might want to say, it will force a value that does not fit, producing silently wrong categorisations. Always include an “other” or “unknown” option for categorical fields unless you are certain the enum is exhaustive. Missing optional handling. If a field might legitimately be absent from the input — an order ID that was not mentioned, a date that was not provided — mark it as optional in your schema (field: str | None = None). Requiring fields that are sometimes absent causes the model to hallucinate values, which is worse than a null. Schema complexity creep. Structured output accuracy degrades with very large schemas. If you find yourself building a schema with 30+ fields, split it into multiple smaller extractions. Each extraction is more accurate and the combined result is more reliable than a single complex extraction. Ignoring retry failures. Even with retry logic, some inputs will exhaust all retries and fail. Always have a fallback path — log the failure, alert on elevated failure rates, and decide whether to discard the item, route it to human review, or use a partial result. Silently swallowing retry exhaustion is a common source of invisible data quality problems in production pipelines.

Structured outputs are not glamorous, but they are the foundation of reliable LLM application engineering — the difference between a system that works consistently and one that requires constant human intervention to catch the cases where the model decided to respond in a different format than you asked for.

Leave a Comment