What Is Function Calling?
Function calling (also called tool use) allows LLMs to request the execution of functions you define, rather than just returning text. Instead of answering “the weather in London is 18°C” from training knowledge, the model can call a get_weather function with the argument {"location": "London"}, receive the actual current data, and incorporate it into a grounded response. This transforms LLMs from static knowledge bases into dynamic systems that can interact with APIs, databases, and external services.
The mechanism works through a structured negotiation: you provide the model with function definitions (name, description, parameter schema), it decides whether to call one, returns a structured tool call object with arguments, you execute the function and return the result, and the model uses that result to produce a final response. The model never directly executes code — it generates a request that your application executes and feeds back.
Defining Functions with JSON Schema
Functions are defined as JSON Schema objects describing their name, purpose, and parameters:
from openai import OpenAI
import json
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location. Use when the user asks about weather conditions.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country, e.g. 'London, UK' or 'Tokyo, Japan'"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit. Default to celsius unless user specifies."
}
},
"required": ["location"]
}
}
}
]
The description fields are critical — they guide the model’s decision about when and how to call the function. Write descriptions from the model’s perspective: “use when the user asks about X” rather than just naming what the function does. Ambiguous or missing descriptions lead to incorrect tool selection.
A Complete Function Calling Loop
def run_with_tools(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
# First call — model decides whether to use a tool
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto" # model decides; or "required" / {"type":"function","function":{"name":"get_weather"}}
)
msg = response.choices[0].message
# If no tool call, return the text response directly
if not msg.tool_calls:
return msg.content
# Process each tool call
messages.append(msg) # add assistant message with tool_calls
for tool_call in msg.tool_calls:
args = json.loads(tool_call.function.arguments)
# Execute the actual function
if tool_call.function.name == "get_weather":
result = get_weather(**args) # your real implementation
else:
result = {"error": f"Unknown function: {tool_call.function.name}"}
# Add tool result to messages
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
# Second call — model synthesises final response using tool results
final = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools
)
return final.choices[0].message.content
print(run_with_tools("What's the weather like in Tokyo right now?"))
Multiple Tools and Parallel Calling
You can define multiple tools and the model will choose the most appropriate one — or call several in parallel when the query requires it. GPT-4o supports parallel tool calls, making multiple function calls in a single response when it determines they can be resolved independently:
tools = [
{"type":"function","function":{"name":"get_weather","description":"Get weather for a location",
"parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}},
{"type":"function","function":{"name":"get_stock_price","description":"Get current stock price",
"parameters":{"type":"object","properties":{"ticker":{"type":"string","description":"Stock ticker symbol e.g. AAPL"}},"required":["ticker"]}}},
{"type":"function","function":{"name":"search_news","description":"Search recent news articles",
"parameters":{"type":"object","properties":{"query":{"type":"string"},"days":{"type":"integer","default":7}},"required":["query"]}}}
]
# "What's the weather in NYC and the current AAPL stock price?"
# Model may call get_weather AND get_stock_price in parallel
response = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools, tool_choice="auto",
parallel_tool_calls=True # default True for gpt-4o
)
# Handle all tool calls
for tool_call in response.choices[0].message.tool_calls:
print(f"Calling: {tool_call.function.name}({tool_call.function.arguments})")
Process parallel tool calls by executing them concurrently — use asyncio.gather or a thread pool to run them simultaneously rather than sequentially, especially for I/O-bound functions like API calls.
Structured Outputs via Tool Forcing
Tool forcing — setting tool_choice to a specific function — reliably extracts structured data from any input. The model is forced to populate the function’s schema, giving you guaranteed valid JSON output even when the input is unstructured text:
extract_tool = {
"type": "function",
"function": {
"name": "extract_contact",
"description": "Extract contact information from text",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string", "format": "email"},
"phone": {"type": "string"},
"company": {"type": "string"}
},
"required": ["name"]
}
}
}
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role":"user","content":"Hi, I'm Sarah Chen from Acme Corp. Reach me at sarah@acme.com or 555-0192."}],
tools=[extract_tool],
tool_choice={"type":"function","function":{"name":"extract_contact"}}
)
contact = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(contact)
# {"name": "Sarah Chen", "email": "sarah@acme.com", "phone": "555-0192", "company": "Acme Corp"}
This pattern is the OpenAI-native equivalent of structured outputs — often more reliable than JSON mode because the schema is enforced at the parameter level. It works for any extraction or classification task: invoice parsing, sentiment analysis with structured scores, entity recognition, form field extraction.
Function Calling with Anthropic Claude
Claude uses the same conceptual pattern with slightly different API syntax. The tool definition format uses input_schema instead of parameters:
import anthropic
client = anthropic.Anthropic()
tools = [{
"name": "get_weather",
"description": "Get current weather for a location.",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and country"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}]
response = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
tools=tools,
messages=[{"role":"user","content":"What's the weather in Paris?"}]
)
# Check for tool use
for block in response.content:
if block.type == "tool_use":
print(f"Tool: {block.name}, Input: {block.input}")
# Execute the function and continue the conversation
result = get_weather(**block.input)
# Continue conversation with tool result
followup = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024, tools=tools,
messages=[
{"role":"user","content":"What's the weather in Paris?"},
{"role":"assistant","content":response.content},
{"role":"user","content":[{"type":"tool_result","tool_use_id":block.id,"content":str(result)}]}
]
)
print(followup.content[0].text)
Agentic Loops: Continuing Until Done
For agents that need to take multiple actions before completing a task, wrap the tool-calling loop in a while loop that continues until the model returns a text response with no tool calls:
def run_agent(task: str, max_iterations: int = 10) -> str:
messages = [{"role": "user", "content": task}]
for i in range(max_iterations):
response = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools
)
msg = response.choices[0].message
messages.append(msg)
if not msg.tool_calls:
return msg.content # Agent finished
for tool_call in msg.tool_calls:
result = execute_tool(tool_call.function.name,
json.loads(tool_call.function.arguments))
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
print(f"Iteration {i+1}: called {len(msg.tool_calls)} tool(s)")
return "Max iterations reached"
Set a reasonable max_iterations ceiling to prevent runaway agents that loop indefinitely on ambiguous tasks. Log each tool call for debugging — agentic failures are much easier to diagnose when you can see exactly what the agent tried to do at each step.
Error Handling and Validation
Function arguments from the model should be validated before execution — the model occasionally produces arguments that violate your schema or contain unexpected values. Always wrap tool execution in try/except and return structured error responses that the model can use to self-correct:
def execute_tool(name: str, args: dict) -> dict:
try:
if name == "get_weather":
if "location" not in args or not args["location"].strip():
return {"error": "location is required and cannot be empty"}
return get_weather(args["location"], args.get("unit", "celsius"))
return {"error": f"Unknown tool: {name}"}
except Exception as e:
return {"error": f"Tool execution failed: {str(e)}"}
Returning {"error": "...message..."} rather than raising an exception allows the model to see the failure and attempt a correction in the next iteration — often producing a valid second attempt without any application-level retry logic.
When to Use Function Calling
Function calling is the right pattern when your application needs: real-time data (weather, stock prices, live search results), actions in external systems (database writes, API calls, sending emails), structured data extraction from unstructured text, or multi-step agentic workflows that require decisions at each step. It is not the right pattern for static knowledge retrieval where RAG is more appropriate, or for simple prompt-response applications where the model’s training knowledge is sufficient. The best LLM applications combine both — RAG for grounding responses in your document corpus, and function calling for dynamic actions that require real-time data or external system interaction.
Streaming with Function Calls
For responsive UIs, you can stream function calling responses. The model streams the tool call arguments as they are generated, allowing you to start processing before the full response arrives:
stream = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools, stream=True
)
collected_args = ""
tool_name = ""
for chunk in stream:
delta = chunk.choices[0].delta
if delta.tool_calls:
for tc in delta.tool_calls:
if tc.function.name:
tool_name = tc.function.name
if tc.function.arguments:
collected_args += tc.function.arguments
# Once stream completes, execute the function
if tool_name:
args = json.loads(collected_args)
result = execute_tool(tool_name, args)
Streaming is most useful when the model generates a long text response after receiving tool results — the final answer streams token by token to the user while the tool execution happened in one shot beforehand.
Function Calling Best Practices
A few patterns consistently improve reliability. Keep function descriptions specific and action-oriented — “search the product catalogue by name or category” is better than “search products.” Define parameter descriptions that explain the expected format and constraints — “ISO 8601 date string, e.g. 2026-06-15” prevents the model from guessing format. Limit the number of tools in any single call to 10–15 maximum; too many tools confuse the model’s selection and degrade performance. Use tool_choice="required" when you need to guarantee the model calls a tool rather than responding with text. And always log tool call arguments alongside execution results — when an agent behaves unexpectedly, the argument log is the fastest way to identify whether the model misunderstood the task or whether the function implementation was wrong.
Real-World Use Cases
Function calling unlocks a wide range of production applications that would otherwise require complex custom parsing logic. Database query interfaces: define a query_database function that accepts natural language, translates it to SQL internally, and returns results — giving non-technical users database access through plain language without exposing SQL directly. Customer support automation: functions for get_order_status, process_refund, update_shipping_address give the support LLM the ability to take real actions rather than just answering questions. Calendar and scheduling assistants: check_availability, create_event, and send_invite functions allow a conversational scheduling assistant to actually book meetings. Financial data dashboards: functions that query your data warehouse return live metrics into LLM-generated narrative summaries — the model explains what the numbers mean in plain language, calling the data functions it needs to ground its explanation in current figures. In all of these cases, the core value is the same: the model provides the natural language understanding and reasoning, while your functions provide the actual data and actions — a division of labour that plays to each component’s strengths.
Testing Function Calling Applications
Testing LLM function calling requires verifying two distinct things: that the model calls the right function with the right arguments for a given input, and that your function implementations handle the full range of arguments the model might plausibly generate. Write unit tests for each function that cover valid inputs, edge cases, and the error response format. Write integration tests that send representative natural language queries and assert on which tool was called and with what arguments — these tests catch prompt or schema changes that break tool selection. Mock the actual external APIs in tests so your test suite does not depend on network availability or incur API costs. Track tool call accuracy over time the same way you track model response quality — it is a leading indicator of agent reliability that often degrades silently as prompts or schemas change.
Function Calling vs. RAG: Choosing the Right Approach
A common architectural question is when to use function calling versus RAG for grounding LLM responses in external data. The distinction comes down to data freshness, structure, and action requirements. RAG is the right pattern when: you have a large corpus of documents that change infrequently, retrieval is the primary need and no actions are required, and semantic similarity search is the appropriate way to find relevant information. Function calling is the right pattern when: you need real-time or frequently changing data that cannot be pre-indexed, the data is structured and returned from APIs or databases rather than unstructured documents, and the application needs to take actions (write, update, send) rather than just read. Many production applications benefit from both simultaneously — RAG handles the document knowledge base while function calling handles live data queries and system actions. The models manage both patterns naturally within the same conversation, calling retrieval functions when they need document context and action functions when they need to do something in an external system.
Security Considerations
Function calling introduces security concerns that do not apply to pure text generation. The model controls what arguments are passed to your functions — and those arguments derive from user input that could be malicious. Always validate and sanitise function arguments before using them in database queries, API calls, or file operations. SQL injection via function arguments is a real risk if you construct queries by string interpolation from model output. File path traversal is a risk for any function that operates on file paths derived from model arguments. Rate limit function calls per user session to prevent resource exhaustion from adversarially crafted inputs that cause expensive tool chains. Treat model-generated arguments with the same distrust you would apply to any other user-controlled input — because they ultimately derive from one.
Function calling is one of the most powerful and underutilised LLM features available today — the gap between teams that use it well and teams that treat LLMs as pure text generators is often the difference between applications that are impressive demos and applications that deliver real operational value.