How to Build an LLM Agent with Tool Calling: A Step-by-Step Guide with Code Examples

What Is Tool Calling?

Tool calling — sometimes called function calling — is the mechanism that lets an LLM do more than generate text. Instead of producing only a prose response, a model can emit a structured signal indicating it wants to invoke an external function: search the web, query a database, run a calculation, call an API, or execute code. The application receives that signal, runs the function, passes the result back to the model, and the model uses it to complete its response.

This turns a language model into an agent: a system that can plan, act, observe the results of its actions, and iterate until a goal is reached. Tool calling is the foundation of virtually every serious LLM application in production today — from coding assistants that can run tests, to customer service bots that can look up order status, to research agents that can fetch and synthesise live information.

How Tool Calling Works: The Protocol

The mechanics vary slightly by provider, but the core protocol is consistent across OpenAI, Anthropic, Google, and open-source models that support it:

  1. Tool definition — You describe the available tools to the model in your API call. Each tool has a name, a description, and a JSON schema defining its parameters.
  2. Model decision — The model reads the user message and the tool definitions. If it decides a tool is needed, it returns a structured tool-use response instead of (or alongside) a text response.
  3. Tool execution — Your application code receives the tool call, extracts the parameters, runs the actual function, and captures the result.
  4. Result injection — You send the tool result back to the model as a new message. The model then uses it to generate the final response.

This loop can repeat multiple times in a single conversation turn — the model might call three tools in sequence before producing a final answer. That multi-step behavior is what separates a tool-calling agent from a simple one-shot completion.

Defining Tools: JSON Schema

Tools are defined using JSON Schema. Here is a simple example of a weather tool definition in the OpenAI format:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a given city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city name, e.g. 'London'"
                    },
                    "units": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature units"
                    }
                },
                "required": ["city"]
            }
        }
    }
]

The description fields matter enormously. The model uses them — not the parameter names — to decide when and how to call the tool. A vague description like “gets data” will produce unreliable tool selection. A specific description like “Gets the current temperature, wind speed, and weather condition for a named city using the OpenWeatherMap API” gives the model the context it needs to use the tool correctly.

Building a Tool-Calling Agent in Python

Here is a complete, minimal agent loop using the OpenAI Python SDK. It handles multi-step tool calls automatically:

import json
import openai

client = openai.OpenAI()

def get_weather(city: str, units: str = "celsius") -> str:
    # Replace with a real API call
    return json.dumps({"city": city, "temp": 22, "condition": "sunny", "units": units})

def run_conversation(user_message: str):
    messages = [{"role": "user", "content": user_message}]
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a city.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"},
                        "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                    },
                    "required": ["city"]
                }
            }
        }
    ]

    while True:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )
        msg = response.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:
            # No more tool calls — return the final text response
            return msg.content

        # Execute each tool call and add results to messages
        for tool_call in msg.tool_calls:
            name = tool_call.function.name
            args = json.loads(tool_call.function.arguments)

            if name == "get_weather":
                result = get_weather(**args)
            else:
                result = json.dumps({"error": f"Unknown tool: {name}"})

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

print(run_conversation("What's the weather in Tokyo and Paris?"))

A few things to note about this pattern. First, the loop continues as long as the model returns tool calls — this naturally handles multi-step reasoning where the model needs to call several tools before it can answer. Second, all tool results are injected back into the message history with role: "tool" and the matching tool_call_id. Third, the model can request multiple tool calls in a single response — the loop processes all of them before sending the results back.

Parallel Tool Calls

Modern models can request multiple tool calls simultaneously in a single response. In the weather example above, asking about Tokyo and Paris might produce two tool calls in one turn rather than two sequential turns. Your loop already handles this correctly because it iterates over msg.tool_calls — but it is worth knowing that you can also execute them in parallel for speed:

import asyncio

async def run_tool(tool_call):
    name = tool_call.function.name
    args = json.loads(tool_call.function.arguments)
    if name == "get_weather":
        return tool_call.id, get_weather(**args)
    return tool_call.id, json.dumps({"error": "unknown tool"})

# In the async version of the loop:
results = await asyncio.gather(*[run_tool(tc) for tc in msg.tool_calls])
for tool_call_id, result in results:
    messages.append({
        "role": "tool",
        "tool_call_id": tool_call_id,
        "content": result
    })

For tools that make network calls — APIs, database queries, web searches — running them in parallel can reduce latency significantly, especially when the model requests three or four tools at once.

Tool Calling with Anthropic Claude

The Anthropic API uses a slightly different format. Tools are defined in a tools array and results are returned as tool_result blocks inside user messages:

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
]

messages = [{"role": "user", "content": "What's the weather in Seoul?"}]

while True:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )

    if response.stop_reason == "end_turn":
        text = next(b.text for b in response.content if b.type == "text")
        print(text)
        break

    # Add assistant response to history
    messages.append({"role": "assistant", "content": response.content})

    # Process tool uses
    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = get_weather(block.input["city"])
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result
            })

    messages.append({"role": "user", "content": tool_results})

The logic is identical — define tools, loop until no more tool calls, inject results — but the message format differs. Anthropic uses stop_reason: "tool_use" to signal a tool call, and results go into the next user message as tool_result blocks.

Designing Good Tools

The quality of your tool definitions determines the reliability of your agent far more than model choice does. A few principles that make a real difference in practice:

One tool, one purpose. Resist the temptation to build a single “do_everything” tool with a dozen parameters. Models select tools based on descriptions, and a sprawling tool with many optional parameters creates ambiguity. A narrow tool with a clear, single purpose is called correctly far more often.

Descriptions over names. The model reads the description to decide whether to call the tool, and reads the parameter descriptions to fill them in. The function name barely matters for selection — the description is everything. Write descriptions as if you were explaining the tool to a colleague who has never seen the codebase.

Return structured data. Tool results go back into the model’s context window. Return clean JSON rather than raw API responses full of irrelevant fields. Strip everything the model does not need — this reduces token usage and makes it easier for the model to extract the relevant information.

Make failures explicit. If a tool call fails, return a JSON error object with a clear message rather than raising an exception or returning an empty string. The model can then decide how to handle the failure — retry, ask the user, or report the error — rather than hallucinating a result.

Limit the tool set. Each tool you add increases the cognitive load on the model. For most tasks, four to six well-designed tools outperform twelve mediocre ones. Start small and add tools only when you have a clear need.

Preventing Tool Abuse and Infinite Loops

Two failure modes are common in production tool-calling agents. The first is the model calling a tool repeatedly with the same arguments — a reasoning loop that never terminates. The second is the model hallucinating tool calls for tools that do not exist.

Both can be mitigated with simple safeguards in your agent loop:

MAX_ITERATIONS = 10

def run_agent(user_message: str):
    messages = [{"role": "user", "content": user_message}]
    iterations = 0

    while iterations < MAX_ITERATIONS:
        iterations += 1
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )
        msg = response.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:
            return msg.content

        for tool_call in msg.tool_calls:
            name = tool_call.function.name
            if name not in REGISTERED_TOOLS:
                result = json.dumps({"error": f"Tool '{name}' does not exist."})
            else:
                args = json.loads(tool_call.function.arguments)
                result = REGISTERED_TOOLS[name](**args)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })

    return "Agent reached maximum iterations without completing the task."

The iteration cap prevents runaway loops. The tool registry check catches hallucinated tool names before they reach your execution layer and cause runtime errors.

Tool Calling with Open-Source Models

Many open-source models now support tool calling, including Llama 3.1+, Mistral, Qwen2.5, and Gemma 3. When running these locally with Ollama, you can use the OpenAI-compatible endpoint to call them with the same tool format:

client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

Tool calling reliability varies significantly between open-source models. Larger models (70B+) tend to follow tool schemas reliably. Smaller models (7B–13B) can hallucinate parameter values or call the wrong tool, particularly with complex schemas. Test thoroughly with your specific model and tool set before deploying.

When to Use Tool Calling vs. Other Patterns

Tool calling is the right choice when your application needs to take actions or retrieve information that is not in the model's training data. It is not always the best pattern though. If you are simply passing static context to the model — a document, a user profile, a set of facts — RAG or direct context injection is simpler and more reliable. Tool calling adds complexity: you need an agent loop, error handling, and careful tool design. Use it when the dynamic, action-taking capability is genuinely needed, not as a default architecture for every LLM feature.

Structured Outputs and Forced Tool Use

By default, tool_choice: "auto" lets the model decide whether to call a tool. There are two other modes worth knowing. Setting tool_choice: "required" forces the model to call at least one tool — useful when you need a structured extraction and do not want a prose response. Setting tool_choice: {"type": "function", "function": {"name": "my_tool"}} forces the model to call a specific tool, which is a reliable way to extract structured data from unstructured input.

This pattern is commonly used as a cheaper alternative to structured output APIs. Define a tool whose parameters match the schema of the data you want to extract, force the model to call it, and parse the arguments. For example, to extract key fields from a support ticket:

extract_ticket_tool = {
    "type": "function",
    "function": {
        "name": "extract_ticket_fields",
        "description": "Extract structured fields from a support ticket.",
        "parameters": {
            "type": "object",
            "properties": {
                "issue_type": {"type": "string", "enum": ["billing", "technical", "account", "other"]},
                "urgency": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
                "product_mentioned": {"type": "string"},
                "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative", "angry"]}
            },
            "required": ["issue_type", "urgency", "sentiment"]
        }
    }
}

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": ticket_text}],
    tools=[extract_ticket_tool],
    tool_choice={"type": "function", "function": {"name": "extract_ticket_fields"}}
)
fields = json.loads(response.choices[0].message.tool_calls[0].function.arguments)

This is a practical, cost-effective pattern that works across all models that support tool calling, without needing a separate structured output API.

Observability: Logging Tool Calls in Production

Tool-calling agents are harder to debug than simple completions because failures can happen at multiple points: the model might call the wrong tool, pass invalid parameters, the tool itself might fail, or the model might misinterpret the result. Good logging is essential.

At minimum, log the full message history for every agent run, including all tool calls and their results. This lets you replay any failing run locally and understand exactly what the model saw at each step. Additionally, track per-tool metrics: how often each tool is called, how often it fails, and how long it takes. Anomalies in these metrics — a suddenly high failure rate on one tool, or an unusual increase in average iterations per run — are early signals of problems before they surface as user complaints.

Tools like LangSmith, Langfuse, and Weave from Weights and Biases all support tracing for multi-step tool-calling agents, capturing the full execution graph automatically. For simpler needs, a structured JSON log written to your existing logging infrastructure is often sufficient and requires no additional dependencies.

A Practical Starting Point

The fastest way to get a tool-calling agent working reliably is to start with the smallest number of tools that solve your actual problem, write meticulous descriptions for each one, add an iteration cap, log everything, and test with adversarial inputs before shipping. The agent loop itself is simple — the complexity is in the tool design and the error handling around it. Once those are solid, you can extend the tool set incrementally, guided by real usage patterns rather than speculation about what the model might need.

Tool Calling vs. MCP

It is worth distinguishing tool calling from the Model Context Protocol (MCP). Tool calling is a low-level API mechanism built into the model provider's API — it is how you define and invoke individual functions. MCP is a higher-level protocol that standardises how tools and data sources are exposed to AI agents across different providers and applications. In practice, MCP servers expose their capabilities as tools that an agent can call, so the two are complementary: MCP provides the discovery and transport layer, while tool calling is the runtime mechanism the model uses to invoke what it finds. If you are building a standalone application, tool calling with direct function definitions is simpler. If you are building a platform where multiple agents need to share tools and data sources, MCP is worth exploring as the connective tissue.

Leave a Comment