How Agents Decide What Tool to Call

The promise of AI agents is autonomy—systems that reason about tasks, select appropriate tools, and execute multi-step workflows without constant human guidance. But watch an agent in action and you’ll often see baffling tool selection: calling a web search when a calculator would work, invoking database queries for information in recent conversation, or repeatedly choosing the wrong tool despite previous failures. Understanding how agents decide which tool to call reveals both the sophisticated reasoning involved and the fundamental limitations that cause these failures.

Tool selection isn’t a simple keyword matching problem. When you give an agent “What’s 15% of the GDP of France?”, it must recognize that this requires: first, finding France’s GDP (web search or knowledge lookup), then performing a calculation (calculator tool). The agent needs to understand task decomposition, tool capabilities, and execution sequencing. Different agent frameworks use different approaches—from simple prompt engineering to sophisticated reasoning loops—with vastly different success rates. This exploration reveals what actually happens inside agent decision-making and why it so often goes wrong.

The Tool Selection Problem

Before understanding how agents choose tools, we need to appreciate why tool selection is challenging in the first place.

What Makes Tool Selection Hard

Agents face an open-ended selection problem with multiple valid approaches. Consider “Book me a flight to Tokyo.” An agent might:

Search flights → Select option → Book
Check calendar for dates → Search flights → Book
Ask user for preferences → Search flights → Book
Check budget → Search flights in price range → Book

All these workflows are reasonable but require different tool sequences. The agent must decide not just which individual tool to call but the entire execution strategy.

Tool descriptions are ambiguous. A “web_search” tool might be described as “searches the internet for information.” But should this be used for:

Current news? (Yes)
Historical facts? (Maybe, depends on recency needs)
Personal calendar events? (No, that’s a different tool)
Mathematical calculations? (No, use calculator)

The boundary between tool capabilities isn’t crisp, forcing agents to make judgment calls that sometimes fail.

The Vocabulary Problem

Tool descriptions use natural language that LLMs must interpret. Consider these tool descriptions:

search_web: "Search the internet for current information"
query_database: "Query the internal knowledge database"
get_weather: "Get current weather for a location"

search_web: "Search the internet for current information"
query_database: "Query the internal knowledge database"
get_weather: "Get current weather for a location"

An agent seeing “What’s the weather in Paris?” must map:

“weather” in the query to “weather” in tool name (obvious)
“Paris” to the required location parameter
“current” (implied in query) to “current” in tool description

This seems simple but fails frequently when queries use synonyms, indirect references, or ambiguous language. “How hot is it in the French capital?” requires recognizing “hot” relates to weather and “French capital” means Paris.

Prompt-Based Tool Selection

The simplest agent implementations use prompt engineering to guide tool selection.

Tool Description in Prompts

The agent’s system prompt includes tool descriptions:

You are an AI assistant with access to these tools:

1. web_search(query: str) -> str
   Searches the internet for current information.
   Use when you need recent news, current events, or information 
   that changes frequently.

2. calculator(expression: str) -> float
   Evaluates mathematical expressions.
   Use for any arithmetic, percentage calculations, or math problems.

3. get_weather(location: str) -> dict
   Returns current weather for a location.
   Use when asked about weather, temperature, or conditions.

When answering questions, think about which tool (if any) would help.
Respond with tool calls in this format:
TOOL: tool_name(arguments)

You are an AI assistant with access to these tools:

1. web_search(query: str) -> str
   Searches the internet for current information.
   Use when you need recent news, current events, or information 
   that changes frequently.

2. calculator(expression: str) -> float
   Evaluates mathematical expressions.
   Use for any arithmetic, percentage calculations, or math problems.

3. get_weather(location: str) -> dict
   Returns current weather for a location.
   Use when asked about weather, temperature, or conditions.

When answering questions, think about which tool (if any) would help.
Respond with tool calls in this format:
TOOL: tool_name(arguments)

The LLM then reasons about which tool to use based on these descriptions and the user query.

Example internal reasoning (not visible to users):

User: "What's 15% of France's GDP?"

Agent reasoning:
- This requires France's GDP (web_search)
- Then calculating 15% (calculator)
- First action: web_search("France GDP")

User: "What's 15% of France's GDP?"

Agent reasoning:
- This requires France's GDP (web_search)
- Then calculating 15% (calculator)
- First action: web_search("France GDP")

Strengths of this approach:

Simple to implement
Flexible—just update prompts to add tools
Leverages LLM’s reasoning abilities
Works reasonably well for obvious tool choices

Weaknesses:

Relies entirely on LLM interpretation
No learning from mistakes
Tool descriptions must be very clear
Ambiguous cases handled inconsistently

Few-Shot Examples in Prompts

Adding examples improves tool selection:

You have access to: web_search, calculator, database_query

Examples:

User: "What's 20% of 550?"
Agent: TOOL: calculator(550 * 0.20)

User: "Who won the 2023 World Series?"
Agent: TOOL: web_search("2023 World Series winner")

User: "What's in our Q3 sales report?"
Agent: TOOL: database_query("SELECT * FROM reports WHERE quarter='Q3'")

Now answer the user's question:

You have access to: web_search, calculator, database_query

Examples:

User: "What's 20% of 550?"
Agent: TOOL: calculator(550 * 0.20)

User: "Who won the 2023 World Series?"
Agent: TOOL: web_search("2023 World Series winner")

User: "What's in our Q3 sales report?"
Agent: TOOL: database_query("SELECT * FROM reports WHERE quarter='Q3'")

Now answer the user's question:

Examples teach the agent patterns:

Calculations → calculator
Recent events → web_search
Internal data → database_query

This significantly improves accuracy for queries similar to examples, but doesn’t generalize perfectly to novel situations.

Tool Selection Approaches

Prompt-Based Selection

Mechanism: Tool descriptions in system prompt, LLM chooses based on reasoning
Accuracy: 60-75% for typical tasks
Pros: Simple, flexible, no training needed
Cons: Inconsistent, no learning, relies on prompt quality

ReAct (Reasoning + Acting)

Mechanism: Agent reasons explicitly about tool choice before acting
Accuracy: 70-80% for typical tasks
Pros: Transparent reasoning, better complex workflows
Cons: Slower, more tokens, still makes mistakes

Function Calling (Structured Output)

Mechanism: Model outputs structured tool calls in JSON/function format
Accuracy: 75-85% for typical tasks
Pros: Reliable parsing, better tool chaining
Cons: Requires model support, less flexible

Fine-Tuned Routing

Mechanism: Separate model trained to classify queries → tools
Accuracy: 85-95% for domain-specific tasks
Pros: Very accurate within training domain
Cons: Requires training data, doesn’t generalize to new tools

ReAct: Reasoning and Acting

The ReAct pattern makes agent reasoning explicit, improving tool selection through structured thought processes.

The ReAct Pattern

ReAct alternates between Reasoning and Acting:

Thought: I need to find France's GDP first
Action: web_search("France GDP 2023")
Observation: France's GDP is approximately €2.7 trillion

Thought: Now I need to calculate 15% of 2.7 trillion
Action: calculator(2.7 * 0.15)
Observation: 0.405

Thought: I have the answer
Final Answer: 15% of France's GDP is approximately €405 billion

Thought: I need to find France's GDP first
Action: web_search("France GDP 2023")
Observation: France's GDP is approximately €2.7 trillion

Thought: Now I need to calculate 15% of 2.7 trillion
Action: calculator(2.7 * 0.15)
Observation: 0.405

Thought: I have the answer
Final Answer: 15% of France's GDP is approximately €405 billion

Each step includes:

Thought: Explicit reasoning about what to do next
Action: Tool call with arguments
Observation: Tool result
Repeat until ready for final answer

How ReAct Improves Selection

Explicit reasoning helps the agent:

Break down complex queries into steps
Recognize when previous tool didn’t work
Adjust strategy based on results
Avoid repeating failed approaches

Example of self-correction:

User: "Find the email address for the CEO of Anthropic"

Thought: I should search for this information
Action: web_search("Anthropic CEO email address")
Observation: No email addresses found, privacy policy suggests 
contacting through website form

Thought: The web search didn't find an email. I shouldn't make one up.
Instead I should explain how to contact them.
Final Answer: I couldn't find a direct email address for Anthropic's 
CEO. You can contact them through their website contact form at...

User: "Find the email address for the CEO of Anthropic"

Thought: I should search for this information
Action: web_search("Anthropic CEO email address")
Observation: No email addresses found, privacy policy suggests 
contacting through website form

Thought: The web search didn't find an email. I shouldn't make one up.
Instead I should explain how to contact them.
Final Answer: I couldn't find a direct email address for Anthropic's 
CEO. You can contact them through their website contact form at...

Without explicit reasoning, the agent might hallucinate an email or keep searching unsuccessfully.

ReAct Limitations

Reasoning isn’t always correct. The agent might:

Think a tool is needed when it’s not
Choose the wrong tool based on faulty reasoning
Get stuck in reasoning loops

Example failure:

Thought: I need current weather data
Action: web_search("Paris weather")
Observation: [weather website content, messy HTML]

Thought: The search gave me a weather website but not structured data.
Action: web_search("Paris weather today temperature")
Observation: [different weather site, still messy]

Thought: Still not getting clean data. Let me try more specific search.
Action: web_search("Paris weather forecast Celsius")
[Continues searching instead of using weather_api tool]

Thought: I need current weather data
Action: web_search("Paris weather")
Observation: [weather website content, messy HTML]

Thought: The search gave me a weather website but not structured data.
Action: web_search("Paris weather today temperature")
Observation: [different weather site, still messy]

Thought: Still not getting clean data. Let me try more specific search.
Action: web_search("Paris weather forecast Celsius")
[Continues searching instead of using weather_api tool]

The agent correctly recognizes search isn’t working but incorrectly decides to keep searching rather than switching to a weather-specific tool.

Function Calling and Structured Output

Modern LLMs support outputting structured tool calls directly, making agent implementations more reliable.

How Function Calling Works

Instead of freeform text, the model outputs structured JSON:

{
  "tool": "web_search",
  "arguments": {
    "query": "France GDP 2023"
  }
}

{
  "tool": "web_search",
  "arguments": {
    "query": "France GDP 2023"
  }
}

Tool schemas are provided to the model:

{
  "name": "web_search",
  "description": "Search the web for current information",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Search query"
      }
    },
    "required": ["query"]
  }
}

{
  "name": "web_search",
  "description": "Search the web for current information",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Search query"
      }
    },
    "required": ["query"]
  }
}

The model selects tools based on:

Query content
Tool descriptions
Parameter schemas
Its training on function calling patterns

Advantages Over Prompt-Based Selection

Reliable parsing: JSON output is structured and parseable, unlike freeform text where agents might format tool calls incorrectly.

Better parameter extraction: The schema guides the model to provide arguments in the correct format:

// Correct parameter extraction
{
  "tool": "get_weather",
  "arguments": {
    "location": "Paris, France",
    "units": "celsius"
  }
}

// vs. freeform text that might be:
// "get weather in Paris"
// "weather(Paris)"  
// "TOOL: Paris weather Celsius"

// Correct parameter extraction
{
  "tool": "get_weather",
  "arguments": {
    "location": "Paris, France",
    "units": "celsius"
  }
}

// vs. freeform text that might be:
// "get weather in Paris"
// "weather(Paris)"  
// "TOOL: Paris weather Celsius"

Tool chaining: Models can output multiple tool calls in sequence:

[
  {
    "tool": "web_search",
    "arguments": {"query": "France GDP"}
  },
  {
    "tool": "calculator",
    "arguments": {"expression": "result * 0.15"}
  }
]

[
  {
    "tool": "web_search",
    "arguments": {"query": "France GDP"}
  },
  {
    "tool": "calculator",
    "arguments": {"expression": "result * 0.15"}
  }
]

Function Calling Still Makes Mistakes

The model can call the wrong function despite structured output:

// User: "What's the weather in the French capital?"

// Incorrect selection:
{
  "tool": "web_search",
  "arguments": {"query": "weather French capital"}
}

// Correct selection:
{
  "tool": "get_weather",
  "arguments": {"location": "Paris, France"}
}

// User: "What's the weather in the French capital?"

// Incorrect selection:
{
  "tool": "web_search",
  "arguments": {"query": "weather French capital"}
}

// Correct selection:
{
  "tool": "get_weather",
  "arguments": {"location": "Paris, France"}
}

The structure is correct, but tool choice is wrong. Function calling improves parameter handling, not necessarily tool selection reasoning.

Tool Selection in Multi-Tool Scenarios

When agents have many tools, selection becomes significantly harder.

The Overwhelming Choice Problem

With 20+ tools available, even humans struggle to remember which tool does what. Agents face similar challenges.

Example tool set:

web_search
calculator
get_weather
send_email
create_calendar_event
query_database
file_search
image_generation
code_execution
stock_price_lookup
translation
summarization
[10 more tools…]

Given “Schedule a meeting with John”, the agent must recognize this needs create_calendar_event, not:

send_email (to invite John)
query_database (to find John’s availability)
web_search (to find meeting times)

All these seem plausible without careful consideration.

Tool Categorization

Organizing tools into categories helps selection:

Communication tools: send_email, send_slack, create_ticket Data retrieval: web_search, database_query, file_search Computation: calculator, code_execution, data_analysis Scheduling: create_event, check_calendar, find_availability

The agent prompt includes categories:

When selecting tools, first identify the task category:
- Communication: Sending messages or notifications
- Data retrieval: Finding information
- Computation: Calculations or data processing
- Scheduling: Calendar and time management

Then choose the appropriate tool within that category.

When selecting tools, first identify the task category:
- Communication: Sending messages or notifications
- Data retrieval: Finding information
- Computation: Calculations or data processing
- Scheduling: Calendar and time management

Then choose the appropriate tool within that category.

This two-stage selection (category → specific tool) reduces errors by narrowing choices.

Dynamic Tool Filtering

Not all tools are relevant for every query. Smart agents filter before selecting:

def filter_relevant_tools(query, all_tools):
    """Filter tools likely relevant to the query."""
    
    # Keyword matching
    query_lower = query.lower()
    
    relevant = []
    for tool in all_tools:
        # Check if tool name or keywords appear in query
        if any(keyword in query_lower for keyword in tool.keywords):
            relevant.append(tool)
    
    # If no matches, include all tools (conservative)
    return relevant if relevant else all_tools

# Usage
query = "What's the weather in Tokyo?"
filtered = filter_relevant_tools(query, all_tools)
# Returns: [get_weather, web_search]
# Excludes: calculator, email, database, etc.

def filter_relevant_tools(query, all_tools):
    """Filter tools likely relevant to the query."""
    
    # Keyword matching
    query_lower = query.lower()
    
    relevant = []
    for tool in all_tools:
        # Check if tool name or keywords appear in query
        if any(keyword in query_lower for keyword in tool.keywords):
            relevant.append(tool)
    
    # If no matches, include all tools (conservative)
    return relevant if relevant else all_tools

# Usage
query = "What's the weather in Tokyo?"
filtered = filter_relevant_tools(query, all_tools)
# Returns: [get_weather, web_search]
# Excludes: calculator, email, database, etc.

The agent then chooses among 2-3 tools instead of 20, dramatically improving selection accuracy.

Learning from Mistakes

Basic agents never learn from errors. Advanced approaches incorporate feedback loops.

Error Detection and Retry Logic

When a tool call fails, better agents recognize the failure and try alternatives:

User: "What's the population of Tokyo?"

Attempt 1:
Action: calculator("Tokyo population")
Observation: Error - calculator expects numeric expression

Thought: Calculator was wrong tool. This needs data lookup.
Action: web_search("Tokyo population")
Observation: Tokyo has approximately 14 million people

Final Answer: Tokyo's population is approximately 14 million.

User: "What's the population of Tokyo?"

Attempt 1:
Action: calculator("Tokyo population")
Observation: Error - calculator expects numeric expression

Thought: Calculator was wrong tool. This needs data lookup.
Action: web_search("Tokyo population")
Observation: Tokyo has approximately 14 million people

Final Answer: Tokyo's population is approximately 14 million.

The agent detected calculator was inappropriate and switched to search. This requires:

Recognizing the error message
Understanding why it failed
Knowing what alternative to try

Many agents fail at this and either give up or keep retrying the same wrong tool.

Confidence Scoring

Some frameworks score tool selection confidence:

Query: "What's 2+2?"

Tool scores:
- calculator: 0.95 (high confidence - obvious math)
- web_search: 0.15 (low - no need to search)
- database_query: 0.05 (very low - not internal data)

Query: "What's 2+2?"

Tool scores:
- calculator: 0.95 (high confidence - obvious math)
- web_search: 0.15 (low - no need to search)
- database_query: 0.05 (very low - not internal data)

When confidence is low (<0.6), the agent might:

Ask the user to clarify
Try multiple tools in parallel
Request more information

This prevents:

Confidently choosing the wrong tool
Wasting time on likely-wrong approaches
Frustrating users with obviously incorrect tool calls

Usage Logs and Patterns

Production agents can learn from logs:

# Track which tools work for different query patterns

query_patterns = {
    "What's * weather": "get_weather",  # 95% success rate
    "Calculate *": "calculator",        # 98% success rate  
    "Who won *": "web_search",         # 87% success rate
}

def suggest_tool(query):
    """Suggest tool based on historical patterns."""
    for pattern, tool in query_patterns.items():
        if matches(query, pattern):
            return tool
    return None  # Fall back to LLM reasoning

# Track which tools work for different query patterns

query_patterns = {
    "What's * weather": "get_weather",  # 95% success rate
    "Calculate *": "calculator",        # 98% success rate  
    "Who won *": "web_search",         # 87% success rate
}

def suggest_tool(query):
    """Suggest tool based on historical patterns."""
    for pattern, tool in query_patterns.items():
        if matches(query, pattern):
            return tool
    return None  # Fall back to LLM reasoning

This pattern matching augments LLM selection, catching common cases quickly and accurately.

Why Tool Selection Fails

Ambiguous Tool Descriptions

Tool descriptions overlap or are vague. “Search for information” could mean web search, database search, or file search. Agent guesses wrong.

Poor Query Understanding

Agent misinterprets user intent. “How hot is it?” could mean temperature (weather tool) or spiciness level (different context).

No Feedback Loop

Agent doesn’t recognize when tool selection was wrong. Keeps retrying the same failed approach without switching strategies.

Context Limitations

Tool descriptions and examples consume context. With 20+ tools, descriptions might be truncated or omitted, causing selection errors.

Overfitting to Examples

Agent with few-shot examples works perfectly for similar queries but fails on different patterns not covered by examples.

Practical Strategies for Better Tool Selection

Developers can significantly improve agent tool selection through thoughtful design.

Clear, Distinct Tool Descriptions

Make tool purposes unambiguous:

Bad descriptions:

search: "Search for information"
lookup: "Look up data"
find: "Find things"

search: "Search for information"
lookup: "Look up data"
find: "Find things"

Good descriptions:

web_search: "Search the public internet for current information, 
news, and facts. Use for recent events, public knowledge, and 
real-time data."

database_query: "Query the company's internal PostgreSQL database. 
Use ONLY for proprietary data, customer records, and internal reports. 
Never use for public information."

file_search: "Search uploaded documents and files in the current 
project. Use for finding information in PDFs, docs, and other 
files the user has provided."

web_search: "Search the public internet for current information, 
news, and facts. Use for recent events, public knowledge, and 
real-time data."

database_query: "Query the company's internal PostgreSQL database. 
Use ONLY for proprietary data, customer records, and internal reports. 
Never use for public information."

file_search: "Search uploaded documents and files in the current 
project. Use for finding information in PDFs, docs, and other 
files the user has provided."

Clear boundaries prevent overlap and confusion.

Provide Negative Examples

Show when NOT to use tools:

calculator tool:
✓ Use for: "What's 15% of 200?", "Calculate 45 * 23"
✗ Don't use for: "What's the population of France?" (needs data lookup),
"Schedule a meeting" (wrong tool entirely)

calculator tool:
✓ Use for: "What's 15% of 200?", "Calculate 45 * 23"
✗ Don't use for: "What's the population of France?" (needs data lookup),
"Schedule a meeting" (wrong tool entirely)

Negative examples teach boundaries, helping agents avoid incorrect selections.

Tool Selection Debugging

Log tool selection reasoning:

class Agent:
    def select_tool(self, query):
        reasoning = self.llm.generate(
            f"Think step by step about which tool to use for: {query}"
        )
        
        tool_choice = self.llm.generate(
            f"Based on this reasoning: {reasoning}\nSelect tool:"
        )
        
        # Log for analysis
        logger.info(f"Query: {query}")
        logger.info(f"Reasoning: {reasoning}")
        logger.info(f"Selected: {tool_choice}")
        
        return tool_choice

class Agent:
    def select_tool(self, query):
        reasoning = self.llm.generate(
            f"Think step by step about which tool to use for: {query}"
        )
        
        tool_choice = self.llm.generate(
            f"Based on this reasoning: {reasoning}\nSelect tool:"
        )
        
        # Log for analysis
        logger.info(f"Query: {query}")
        logger.info(f"Reasoning: {reasoning}")
        logger.info(f"Selected: {tool_choice}")
        
        return tool_choice

Reviewing logs reveals patterns in mistakes, enabling prompt improvements.

Progressive Tool Introduction

Don’t give agents all tools at once. Start with core tools:

Phase 1: web_search, calculator (core capabilities) Phase 2: Add database_query, get_weather (common needs) Phase 3: Add specialized tools as needed

Smaller tool sets have higher selection accuracy. Add tools incrementally as the agent proves reliable with existing ones.

Conclusion

How agents decide which tool to call depends on their architecture—ranging from simple prompt-based reasoning to sophisticated function calling with structured output. The core challenge remains constant: interpreting user intent, understanding tool capabilities, and mapping queries to appropriate tools despite ambiguous descriptions and overlapping functionality. ReAct patterns improve selection through explicit reasoning, function calling provides structured reliability, and multi-tool scenarios benefit from categorization and filtering. Yet all approaches still fail regularly, choosing web search when calculator would work or database queries for public information.

Improving agent tool selection requires clear tool descriptions with explicit boundaries, negative examples showing when not to use tools, strategic tool set sizing to avoid overwhelming choice, and feedback loops that recognize and correct mistakes. The goal isn’t perfect tool selection—that may be impossible with current approaches—but rather failing gracefully, recovering from errors, and achieving sufficient accuracy for practical applications. Build agents with tool selection debugging built in, monitor failure patterns, and iterate on descriptions until selection reliability meets your use case requirements. Tool selection is often the difference between impressive demos and production-ready agents.