How to Use the Anthropic Claude API: A Complete Guide with Code Examples

Getting Started with the Anthropic API

The Anthropic API provides access to Claude — Anthropic’s family of AI models including Claude Opus, Sonnet, and Haiku. This guide covers everything from initial setup through production-grade usage patterns including streaming, tool use, vision, prompt caching, and async processing.

Install the Python SDK and set your API key:

pip install anthropic

export ANTHROPIC_API_KEY=sk-ant-your-key-here

import anthropic

client = anthropic.Anthropic()  # Reads ANTHROPIC_API_KEY from environment

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain the difference between RAG and fine-tuning."}]
)
print(message.content[0].text)

Understanding the Messages API

The Anthropic API uses a Messages structure rather than the OpenAI-style Completions format. Key parameters:

model: The Claude model to use. Current options: claude-opus-4-6 (most capable), claude-sonnet-4-6 (balanced), claude-haiku-4-5-20251001 (fastest/cheapest).

max_tokens: Required. The maximum number of output tokens to generate. Claude will stop at this limit or when it naturally completes.

messages: A list of alternating user/assistant turns. Must start with a user message. The API does not maintain state between calls — send the full conversation history each time.

system: Optional. A system prompt that persists across all turns of the conversation, setting context, persona, or constraints for the model.

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are a senior software engineer. Be concise and precise. Use code examples where helpful.",
    messages=[
        {"role": "user", "content": "What are the tradeoffs between PostgreSQL and MongoDB?"},
        {"role": "assistant", "content": "PostgreSQL excels at..."},  # Optional: continue from prior turn
        {"role": "user", "content": "Which would you recommend for a real-time analytics dashboard?"}
    ]
)

Streaming Responses

For interactive applications, streaming delivers tokens as they are generated rather than waiting for the full response:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a Python function to parse ISO 8601 dates."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
print()  # newline after stream completes

# Access full response metadata after streaming
final = stream.get_final_message()
print(f"Tokens used: {final.usage.input_tokens} in / {final.usage.output_tokens} out")

Async streaming for web applications:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()
client = anthropic.AsyncAnthropic()

@app.post("/chat")
async def chat(message: str):
    async def generate():
        async with client.messages.stream(
            model="claude-sonnet-4-6", max_tokens=1024,
            messages=[{"role":"user","content":message}]
        ) as stream:
            async for text in stream.text_stream:
                yield f"data: {text}

"
        yield "data: [DONE]

"
    return StreamingResponse(generate(), media_type="text/event-stream")

Tool Use (Function Calling)

Claude supports tool use — define functions the model can call, execute them in your code, and return results for the model to incorporate into its response:

import json

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a location. Use when asked about weather conditions.",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {"type": "string", "description": "City and country, e.g. 'Paris, France'"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
        },
        "required": ["location"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)

# Process tool call if present
for block in response.content:
    if block.type == "tool_use":
        result = get_weather(**block.input)  # your implementation
        
        # Continue conversation with result
        followup = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=1024, tools=tools,
            messages=[
                {"role": "user", "content": "What's the weather in Tokyo?"},
                {"role": "assistant", "content": response.content},
                {"role": "user", "content": [{"type": "tool_result",
                    "tool_use_id": block.id, "content": json.dumps(result)}]}
            ]
        )
        print(followup.content[0].text)

Vision: Analysing Images

Pass images directly in the messages array alongside text:

import base64

with open("architecture.png", "rb") as f:
    img_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {
            "type": "base64", "media_type": "image/png", "data": img_data
        }},
        {"type": "text", "text": "Identify the main components and data flow in this architecture diagram."}
    ]}]
)
print(response.content[0].text)

# URL-based image (no upload needed)
response = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=512,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "url", "url": "https://example.com/chart.png"}},
        {"type": "text", "text": "What trend does this chart show?"}
    ]}]
)

Prompt Caching for Cost Reduction

Mark large shared prefixes (system prompts, document context) for caching — cached tokens are billed at 10% of normal input token price:

response = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    system=[{
        "type": "text",
        "text": your_large_system_prompt_or_document,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": user_question}]
)

print(f"Tokens cached this call: {response.usage.cache_read_input_tokens}")
print(f"New cache created: {response.usage.cache_creation_input_tokens}")

The cache persists for 5 minutes and resets on each use. For high-throughput applications, concurrent requests keep the cache warm continuously. For a 10,000-token system prompt at 50,000 daily requests, caching saves approximately 90% of the input token cost for that prefix.

Batch Processing at 50% Off

The Message Batches API processes requests asynchronously at half the standard price — ideal for offline document processing, nightly analysis jobs, or any workload that does not need real-time responses:

import anthropic

batch = client.messages.batches.create(requests=[
    anthropic.types.message_create_params.MessageCreateParamsNonStreaming(
        custom_id=f"doc-{i}",
        params={"model": "claude-sonnet-4-6", "max_tokens": 512,
                "messages": [{"role":"user","content":f"Summarise: {doc}"}]}
    )
    for i, doc in enumerate(documents)
])

print(f"Batch ID: {batch.id} | Status: {batch.processing_status}")

# Poll for completion
import time
while (batch := client.messages.batches.retrieve(batch.id)).processing_status == "in_progress":
    time.sleep(60)

# Retrieve results
for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        print(f"{result.custom_id}: {result.result.message.content[0].text[:100]}")

Error Handling and Retries

from anthropic import RateLimitError, APIStatusError
import time, random

def robust_completion(messages, max_retries=5, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6", max_tokens=1024,
                messages=messages, **kwargs
            )
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
        except APIStatusError as e:
            if e.status_code >= 500 and attempt < max_retries - 1:
                time.sleep(2 ** attempt)
            else:
                raise

The Anthropic SDK includes built-in retry logic for rate limits with exponential backoff. Enable it by passing max_retries to the client constructor: client = anthropic.Anthropic(max_retries=5). For custom retry behaviour or integration with your existing retry infrastructure, use the pattern above.

Model Selection Guide

Claude Haiku 4.5 ($0.80/$4.00 per million tokens): The fastest and cheapest Claude model. Use for high-volume classification, extraction, simple Q&A, format conversion, and any task where response time and cost matter more than maximum quality. Processes hundreds of requests per second at scale.

Claude Sonnet 4.6 ($3.00/$15.00 per million tokens): The balanced production model. Strong coding, analysis, and reasoning at moderate cost. The right default for most production applications requiring consistent quality.

Claude Opus 4 ($15.00/$75.00 per million tokens): The most capable model for the hardest tasks — complex reasoning, nuanced analysis, advanced code architecture. Reserve for tasks where quality materially affects outcomes and cost per query is acceptable.

For most production systems, run Haiku for simple tasks and Sonnet for standard tasks, reserving Opus for the 5% of queries that genuinely need frontier reasoning capability. This tiered approach typically reduces average API cost by 70–80% compared to using Opus or Sonnet for everything.

Using Claude with LangChain and LlamaIndex

# LangChain
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6", max_tokens=1024)
response = llm.invoke("Explain transformer attention.")

# LlamaIndex
from llama_index.llms.anthropic import Anthropic
llm = Anthropic(model="claude-sonnet-4-6")
response = llm.complete("Explain transformer attention.")

Both frameworks integrate with all Claude models via the standard Anthropic SDK under the hood. All LangChain chains, agents, and tools work with Claude out of the box. All LlamaIndex query engines and agent frameworks support Claude as the LLM backend. Switching between Claude models in these frameworks requires only a model name change — no application logic changes needed.

Async Client for High-Throughput Applications

Use AsyncAnthropic for concurrent request processing — essential for batch document pipelines and high-traffic web applications:

import asyncio
import anthropic

client = anthropic.AsyncAnthropic()

async def process_document(doc: str, doc_id: str) -> dict:
    response = await client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        messages=[{"role":"user","content":f"Summarise in 3 bullet points:

{doc}"}]
    )
    return {"id": doc_id, "summary": response.content[0].text}

async def batch_process(documents: list[dict]) -> list[dict]:
    tasks = [process_document(d["text"], d["id"]) for d in documents]
    return await asyncio.gather(*tasks, return_exceptions=True)

results = asyncio.run(batch_process(documents))

Running 50 requests concurrently via asyncio.gather reduces wall-clock time from minutes to seconds for document batch workloads. Stay within your API rate limits — Anthropic enforces requests-per-minute and tokens-per-minute limits per tier. If you hit rate limits, add a semaphore to cap concurrency:

semaphore = asyncio.Semaphore(20)  # max 20 concurrent requests

async def rate_limited_call(doc: str):
    async with semaphore:
        return await client.messages.create(
            model="claude-haiku-4-5-20251001", max_tokens=512,
            messages=[{"role":"user","content":doc}]
        )

Extended Thinking for Complex Reasoning

Claude's extended thinking feature allows the model to reason through complex problems step by step before producing a final answer. Enable it for tasks requiring multi-step reasoning, mathematics, or complex analysis:

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{
        "role": "user",
        "content": "Prove by induction that the sum of the first n odd numbers equals n^2."
    }]
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Reasoning: {len(block.thinking)} chars]")
    elif block.type == "text":
        print(f"Answer: {block.text}")

The budget_tokens parameter controls how long the model can reason before producing its final answer. Higher budgets allow more thorough reasoning on harder problems. Extended thinking is particularly effective for mathematical proofs, complex logical analysis, and multi-step planning tasks where the chain-of-thought reasoning significantly improves final answer quality.

TypeScript and JavaScript SDK

The Anthropic API has a first-class TypeScript SDK with full type safety:

// npm install @anthropic-ai/sdk
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env

const message = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Explain async/await in JavaScript." }],
});

console.log(message.content[0].type === "text" && message.content[0].text);

// Streaming
const stream = client.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Write a TypeScript utility type for deep partial." }],
});

for await (const chunk of stream) {
  if (chunk.type === "content_block_delta" && chunk.delta.type === "text_delta") {
    process.stdout.write(chunk.delta.text);
  }
}

Rate Limits and Tier Management

The Anthropic API enforces rate limits at three levels: requests per minute (RPM), input tokens per minute (ITPM), and output tokens per minute (OTPM). Default limits depend on your account tier — Tier 1 starts at 50 RPM and 50K ITPM for Claude Sonnet. As you spend more, your limits automatically increase through Tier 2, 3, and 4. For applications requiring higher limits than your current tier provides, contact Anthropic's sales team — higher tiers are approved based on usage history and use case.

Monitor your rate limit consumption by checking the response headers on every API call:

# The SDK exposes rate limit headers via the raw response
with client.messages.with_raw_response.create(
    model="claude-sonnet-4-6", max_tokens=100,
    messages=[{"role":"user","content":"Hello"}]
) as response:
    print(response.headers.get("anthropic-ratelimit-requests-remaining"))
    print(response.headers.get("anthropic-ratelimit-tokens-remaining"))
    msg = response.parse()
    print(msg.content[0].text)

Production Best Practices

A few patterns that consistently improve Claude API deployments in production. Always set max_tokens explicitly — there is no safe default for every application, and forgetting it risks unexpectedly long and expensive completions. Log the full usage object from every response to track token spend by endpoint and model. Enable prompt caching on any system prompt over 1,024 tokens — the ROI is immediate. Use Haiku for development and testing to save costs before switching to Sonnet or Opus for production evaluation. Set the model name via environment variable rather than hardcoding — switching between Claude versions in production becomes a config change rather than a code deployment. Handle RateLimitError and APIStatusError explicitly with backoff rather than letting them surface as unhandled exceptions. And version your prompts in source control with the same rigour as code — prompt changes are one of the most common sources of quality regressions in production LLM applications.

Getting Your API Key and Billing

Sign up at console.anthropic.com. API access requires a credit card for billing — there is no permanent free tier, though new accounts receive a small credit for initial testing. Usage is billed monthly by token count. Set budget alerts in the console to avoid surprise bills — you can set email alerts at any spend threshold. For enterprise use cases requiring invoicing, volume commitments, or data processing agreements, contact Anthropic's enterprise team through the console. The API is available globally with consistent pricing across regions — unlike some providers, there is no regional pricing variation for Anthropic's standard API.

Structured Outputs via Tool Forcing

Claude does not have a JSON mode, but tool forcing achieves reliable structured output. Force the model to call a specific tool whose schema matches your desired output format — the response is guaranteed to conform to your schema:

extract_tool = {
    "name": "extract_summary",
    "description": "Extract structured summary from document",
    "input_schema": {
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "key_points": {"type": "array", "items": {"type": "string"}, "maxItems": 5},
            "sentiment": {"type": "string", "enum": ["positive","neutral","negative"]},
            "word_count_estimate": {"type": "integer"}
        },
        "required": ["title", "key_points", "sentiment"]
    }
}

response = client.messages.create(
    model="claude-haiku-4-5-20251001", max_tokens=512,
    tools=[extract_tool],
    tool_choice={"type": "tool", "name": "extract_summary"},
    messages=[{"role":"user","content":f"Extract structured summary from:

{document}"}]
)

result = response.content[0].input  # guaranteed to match schema
print(result["title"])
print(result["key_points"])

The instructor library wraps the Anthropic client to add automatic retry on validation failure, Pydantic model support, and cleaner syntax for structured extraction — highly recommended for any application doing systematic data extraction from text.

Document Processing with the Files API

Claude can process PDFs directly without extracting text first — pass the PDF as base64 in the messages array:

import base64

with open("contract.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode()

response = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=2048,
    messages=[{"role":"user","content":[
        {"type":"document","source":{"type":"base64","media_type":"application/pdf","data":pdf_data}},
        {"type":"text","text":"Summarise the key obligations and deadlines in this contract."}
    ]}]
)
print(response.content[0].text)

Claude can process up to 100 pages of PDF content in a single call. For longer documents, chunk them into sections or use Claude's large context window to process the full text extracted from the PDF. Direct PDF processing preserves layout information that plain text extraction loses — tables, headers, and section structure — which improves extraction quality for structured documents like contracts, invoices, and reports.

Comparing Claude to GPT-4o for Production Use

Claude and GPT-4o occupy the same tier in terms of general capability, and the right choice depends on your specific requirements rather than raw benchmark rankings. Claude's strengths in production contexts include superior instruction following and adherence to system prompt constraints, stronger performance on long-form writing and nuanced analysis, a more predictable communication style that tends toward clarity and directness, and the 200K token context window that exceeds GPT-4o's 128K for long-document workloads. GPT-4o's strengths include a more mature tool use and function calling ecosystem, broader third-party integration and documentation, and first access to OpenAI's latest features. For data residency and compliance, both providers offer enterprise tiers with private networking — the choice between them often comes down to which provider's compliance certifications and data handling agreements are a better fit for your organisation's requirements. Running both on a representative sample of your actual queries and measuring quality on your specific task distribution is the only reliable way to determine which performs better for your use case.