How To Build An AI Agent With Tool Use

·By Elysiate·Updated May 6, 2026·
ai-engineering-llm-developmentaillmsai-agents-and-mcpagentstool-calling
·

Level: intermediate · ~14 min read · Intent: informational

Audience: software engineers, developers, product teams

Prerequisites

  • comfort with Python or JavaScript
  • basic understanding of LLMs

Key takeaways

  • Tool use turns an LLM from a text generator into an execution layer that can fetch data, call APIs, and take constrained actions.
  • The safest production agents use narrow tool schemas, explicit orchestration loops, validation, retries, approvals, and observability instead of relying on prompting alone.

FAQ

What is tool use in an AI agent?
Tool use is the pattern where a model chooses from a defined set of tools, produces structured arguments, and then waits for your application to execute the tool and return results.
Do I need memory to build an AI agent with tool use?
No. Many useful agents work well without long-term memory. Start with a stateless tool-using workflow and add memory only when the product truly needs personalization or state across sessions.
What is the difference between tool calling and a real agent?
Tool calling is a capability. An agent is the larger system around it, including instructions, tool routing, execution control, guardrails, approvals, retries, and often state management.
What is the safest way to deploy a tool-using agent?
Use narrow tool permissions, validate every argument, require human approval for high-risk actions, log every step, and ship with evals and rollback paths before granting broad autonomy.
0

This guide explains How To Build An AI Agent With Tool Use with practical examples, edge cases, and production patterns for developers building AI applications with LLMs, agents, and modern AI tooling.

Overview

The fastest way to misunderstand agents is to treat them as magical autonomous workers. In practice, a useful agent is usually a very specific system:

  1. a model receives an instruction and some context,
  2. the model decides whether it should call a tool,
  3. your application executes that tool,
  4. the result is returned to the model,
  5. the model either finishes or chooses the next step.

That loop is the core of tool-using agents.

Without tools, an LLM can only generate text. It can explain how to look up an invoice, summarize what an API might return, or guess how to schedule a meeting. With tools, the model can request the invoice from your system, query the calendar API, validate dates, and draft the correct response using real data. That is why tool use is one of the most important building blocks in production AI systems.

A good tool-using agent does not mean “give the model unlimited access to everything.” The production goal is the opposite: give the model a narrow set of well-designed capabilities, keep execution deterministic where possible, and make the agent reliable enough that humans can trust it.

A simple mental model looks like this:

  • Model: decides what to do next.
  • Instructions: define role, boundaries, and goals.
  • Tools: the allowed actions and data access paths.
  • Execution loop: runs tool calls and feeds results back.
  • Guardrails: prevent unsafe or invalid behavior.
  • State: preserves short-term context and, when truly needed, long-term memory.
  • Observability: lets you see why the agent did what it did.

If you remember one thing from this article, remember this: tool use is not the whole agent; it is the most important capability inside the agent runtime.

What tool use actually means

Tool use is the pattern where the model can choose from one or more tools that you define. Each tool has a name, a purpose, and a structured input schema. The model does not run the tool by itself. Instead, it emits a structured tool call, and your application executes it.

That distinction matters because it keeps the model inside a controlled environment. The model can suggest actions, but your runtime stays in charge of:

  • validating arguments,
  • enforcing permissions,
  • handling retries,
  • preventing duplicate side effects,
  • and deciding whether a requested action should actually happen.

This is the core difference between a toy demo and a production system.

For example, imagine a support agent with these tools:

  • get_customer_account(customer_id)
  • list_open_invoices(customer_id)
  • create_refund_request(invoice_id, reason)
  • escalate_to_human(ticket_id, summary)

The model receives a user message like:

“I was billed twice for March. Can you check and fix it?”

A weak implementation may hallucinate a refund flow and ask the user to wait. A tool-using implementation can:

  1. identify the customer,
  2. fetch invoices,
  3. detect a duplicate charge,
  4. create a refund request if policy conditions are met,
  5. or escalate if confidence is low.

That is real agent behavior, but only because the system was designed carefully around tool use.

When to use an agent with tool use

You do not need an agent for every AI feature. Many features work better as single-call structured workflows. Use an agent with tool use when the task requires one or more of these:

1. Live data access

If the answer depends on current state, the model needs tools. This includes:

  • account balances,
  • shipment tracking,
  • CRM records,
  • order status,
  • calendars,
  • support systems,
  • internal docs,
  • web search or file retrieval.

2. Multi-step decisions

If the job requires “check X, then decide Y, then maybe do Z,” tool use becomes valuable. Common examples include:

  • troubleshooting assistants,
  • procurement copilots,
  • operations dashboards,
  • internal research assistants,
  • support resolution workflows.

3. Controlled action-taking

If the system must do something, not just say something, you need tools. Examples:

  • create tickets,
  • update records,
  • schedule meetings,
  • send approved emails,
  • run searches,
  • trigger internal workflows.

4. Dynamic workflow selection

If there is no single linear path and the system must choose among possible next actions, agents become a better fit than hard-coded chains.

When not to use one

Do not build an agent just because the word sounds advanced.

A regular prompt or structured output workflow is often better when:

  • the task is a one-shot transformation,
  • there is no need for external data,
  • the output format is fixed,
  • deterministic business logic should own the decisions,
  • or the risk of tool misuse is too high.

For example, generating release notes from a known input document probably does not need an agent. Summarizing meeting notes usually does not need an agent. Turning form data into JSON definitely does not need an agent. Over-agenting simple problems makes systems slower, more expensive, and harder to debug.

The core architecture of a tool-using agent

A useful production architecture usually has six layers.

1. User interface or entry point

This can be a chat UI, an API endpoint, a Slack bot, an internal dashboard, or a background workflow trigger.

2. Agent runtime

This is the orchestration layer that sends instructions and context to the model, receives tool calls, executes them, and loops until the run is complete.

3. Tool registry

This is the collection of available tools and their schemas. Every tool should have:

  • a stable name,
  • a clear description,
  • strict argument types,
  • explicit permissions,
  • and well-defined failure modes.

4. Execution and policy layer

This layer validates arguments, checks user authorization, handles approvals, applies rate limits, and prevents unsafe actions.

5. State layer

This stores the conversation history, intermediate outputs, and optional long-term memory or session data.

6. Observability and eval layer

This captures traces, tool usage, failures, latency, token costs, and human-review signals so the system can improve over time.

Step-by-step workflow

Step 1: Define the job before you define the tools

Start with the business task, not the model.

Ask:

  • What job should the agent accomplish?
  • What decisions must it make?
  • Which decisions should stay deterministic?
  • Which systems does it need to read from?
  • Which systems can it write to?
  • What actions require approval?

A weak scope sounds like this:

“Build an internal ops agent that can do everything.”

A better scope sounds like this:

“Build a finance support agent that can read invoice status, verify billing records, and draft refund requests, but cannot send refunds without approval.”

That level of clarity will shape the entire system.

Step 2: Start with the smallest useful tool set

Most teams add too many tools too early. That makes routing worse, increases token load, and raises error rates.

Start with three to five tools that cover the core job. For example, for a document operations assistant:

  • search_documents
  • get_document
  • compare_versions
  • draft_summary
  • escalate_to_human

Every tool should be narrow. Avoid giant tools like:

  • admin_api
  • query_database
  • perform_action
  • run_anything

These are dangerous because they make it easy for the model to choose the wrong path and hard for your runtime to enforce policy.

Step 3: Write tool descriptions like API contracts

Tool descriptions should be explicit and operational.

Bad:

“Gets data from the CRM.”

Better:

“Retrieve a customer profile by exact customer ID. Use this only when the ID is already known. Do not use this tool to search by name or email.”

Good tool design reduces model confusion. In production, clear descriptions often improve performance more than adding another round of prompt tuning.

Step 4: Use strict schemas for arguments

A tool call should be easy to validate before execution. That means:

  • required fields must be required,
  • enums should be enums,
  • booleans should be booleans,
  • optional fields should be truly optional,
  • and ambiguous free-text arguments should be minimized.

For example:

{
  "name": "create_refund_request",
  "description": "Create a refund request for a verified duplicate charge. Requires invoice_id and a policy-valid refund reason.",
  "parameters": {
    "type": "object",
    "properties": {
      "invoice_id": { "type": "string" },
      "reason": {
        "type": "string",
        "enum": ["duplicate_charge", "service_not_delivered", "pricing_error"]
      },
      "customer_message": { "type": "string" }
    },
    "required": ["invoice_id", "reason"],
    "additionalProperties": false
  }
}

Strict schemas reduce execution errors and make downstream logging much easier.

Step 5: Build the execution loop

The execution loop is the heart of the agent.

A basic pattern looks like this:

  1. send instructions, user input, and available tools to the model,
  2. inspect the model response,
  3. if the model returns a tool call, validate it,
  4. execute the tool,
  5. pass the tool result back to the model,
  6. continue until the model returns a final answer or the loop limit is reached.

Pseudo-flow:

while not finished:
  response = model(input, tools, context)

  if response contains tool call:
    validate tool name
    validate arguments
    authorize action
    execute tool
    append tool result to conversation
  else:
    return final answer

A production loop also needs:

  • maximum step count,
  • timeout handling,
  • duplicate tool call detection,
  • per-tool retry rules,
  • and human handoff triggers.

Step 6: Decide what stays model-driven and what stays deterministic

This is where strong systems separate themselves from fragile ones.

Let the model do:

  • classification,
  • summarization,
  • query reformulation,
  • next-step selection,
  • natural-language reasoning,
  • and user-facing explanations.

Keep deterministic logic for:

  • permission checks,
  • pricing rules,
  • policy enforcement,
  • side-effect execution,
  • workflow state transitions,
  • and idempotency control.

The model should help decide what might happen next, but your code should decide what is actually allowed to happen.

Step 7: Add retrieval only when the agent needs it

A lot of teams merge RAG and tool-using agents too early. Keep them conceptually separate.

  • Tool use lets the model take actions or fetch structured data.
  • RAG lets the model retrieve knowledge from documents or indexed content.

Some agents need both. For example, an HR assistant may:

  1. search policy docs,
  2. read the relevant policy section,
  3. compare it with the employee’s request,
  4. and submit an approval workflow.

That is a valid multi-tool design. But do not assume every agent needs a vector database on day one.

Step 8: Add memory only when the product needs continuity

Memory is useful, but it is also one of the fastest ways to complicate the system.

You usually need only one of these at first:

  • short-term state: current conversation context, tool outputs, temporary notes;
  • session memory: user preferences for the current workflow;
  • long-term memory: cross-session facts, stable preferences, or prior decisions.

Many teams should start with no long-term memory at all. A stateless tool-using agent is easier to debug, safer to deploy, and often good enough.

Step 9: Add guardrails before autonomy

Tool-using agents should never run in a trust vacuum.

Useful guardrails include:

  • input validation,
  • tool allowlists,
  • argument validation,
  • content safety checks,
  • approval requirements for write actions,
  • confidence thresholds,
  • loop limits,
  • and hard stops on risky combinations.

Examples of actions that should often require approval:

  • sending money,
  • deleting records,
  • emailing external users,
  • changing permissions,
  • publishing content,
  • booking costly travel,
  • executing shell commands,
  • writing to production systems.

Step 10: Add observability from day one

If you cannot answer “why did the agent do that?”, the system is not production-ready.

Log at least:

  • user request,
  • chosen model,
  • tools exposed,
  • tools actually called,
  • arguments used,
  • execution result,
  • retries,
  • latency,
  • token usage,
  • final response,
  • and whether a human intervened.

This is how you debug bad behavior and build evals later.

A practical example: building a support agent

Imagine you want to build an internal support agent for customer billing.

The job

The agent should help support reps investigate billing questions faster. It can read account data and draft actions, but it cannot directly issue refunds.

The tools

  • get_customer_by_email
  • list_invoices
  • get_payment_events
  • draft_refund_request
  • create_support_note
  • escalate_case

The workflow

A rep asks:

“Can you check whether this customer was charged twice this month and prepare the refund note?”

The agent:

  1. identifies the customer,
  2. pulls invoices,
  3. inspects payment events,
  4. detects a duplicate payment,
  5. drafts a refund request,
  6. writes a support note,
  7. returns a concise explanation.

Why this works well

  • The tool surface is narrow.
  • The agent can only read billing data and draft an internal request.
  • The side effect is controlled.
  • Approval still belongs to a human.
  • Every step is auditable.

That is exactly the kind of workflow where tool-using agents create real value.

Common design mistakes

Mistake 1: Exposing too many tools

When the model sees a huge tool menu, it often chooses poorly or wastes tokens evaluating irrelevant options.

Fix: expose only the tools needed for the current task or user role.

Mistake 2: Weak tool descriptions

If several tools look semantically similar, the model will confuse them.

Fix: write precise descriptions and state when not to use each tool.

Mistake 3: Letting the model handle policy

A model should not be the final authority on permissions, refunds, compliance, or risk.

Fix: keep policy and authorization in deterministic code.

Mistake 4: No approval layer

Autonomous write actions can turn small errors into serious incidents.

Fix: require approval for high-risk actions and store approval events in logs.

Mistake 5: No step limit

Agents can loop, re-query the same system, or bounce between tools.

Fix: enforce a strict maximum number of turns and detect repeated tool calls.

Mistake 6: Adding memory too soon

Teams often add persistent memory before they understand their core workflow.

Fix: ship stateless first, then add narrowly scoped memory with explicit retention rules.

Mistake 7: Measuring only answer quality

An agent can sound good while still using the wrong tools, leaking cost, or creating latency problems.

Fix: evaluate tool choice, argument accuracy, failure recovery, approval correctness, latency, and business outcomes.

Production patterns that work well

Pattern 1: Router plus worker

Use a top-level classifier or router that decides which specialist workflow should handle the task. Then let a worker agent use a small tool set.

This works better than one giant general-purpose agent for most enterprise systems.

Pattern 2: Read tools first, write tools later

Start with agents that retrieve information and propose actions. Add direct write capabilities only after you trust the execution loop.

This is one of the safest ways to mature an agent product.

Pattern 3: Human-in-the-loop checkpoints

Use approval gates for risky actions and uncertain cases. This keeps the product useful without pretending the model is always correct.

Pattern 4: Tool results as structured state

Do not pass giant raw payloads back to the model if a summarized structured result will do. Clean the output before feeding it back into the loop.

This improves latency, reduces token usage, and lowers reasoning noise.

Pattern 5: Task-specific tool exposure

Only show the tools relevant to the current workflow, user role, or session state. Dynamic tool exposure often improves routing quality immediately.

Edge cases to handle

Ambiguous user intent

User: “Fix my billing issue.”

That may mean investigate, explain, waive a fee, or escalate.

What to do: let the agent ask a clarifying question or start with safe read-only tools first.

Missing identifiers

User: “Check invoice 2049.”

What if invoice IDs are tenant-specific or the user lacks access?

What to do: verify scope, identity, and authorization before retrieval.

Partial tool failures

A calendar lookup might succeed while CRM lookup fails.

What to do: let the agent continue with partial context when safe, but surface uncertainty clearly.

Duplicate side effects

A retry after timeout could create two tickets or two refund requests.

What to do: use idempotency keys and server-side deduplication.

Hallucinated tool arguments

The model may invent IDs, dates, or enum values.

What to do: validate every argument and fail safely when values do not match expected formats.

A minimal build plan for most teams

If your team wants to ship a first tool-using agent without overengineering, use this order:

Phase 1: read-only assistant

  • one agent,
  • three to five read tools,
  • strict schemas,
  • logging,
  • no memory,
  • no write access.

Phase 2: guided workflows

  • better routing,
  • step limits,
  • retries,
  • session state,
  • evals for tool choice and answer quality.

Phase 3: controlled action-taking

  • add write tools,
  • add approvals,
  • add per-tool policies,
  • add human handoff,
  • add audit trails.

Phase 4: broader orchestration

  • specialist agents or workflows,
  • dynamic tool exposure,
  • long-term memory only where justified,
  • deeper observability and regression testing.

This staged path is far safer than trying to launch a full autonomous agent on day one.

How to know your agent is working

A tool-using agent is not “good” just because it produces nice answers. It is good when it behaves reliably under real conditions.

Track metrics like:

  • tool-call success rate,
  • tool selection accuracy,
  • argument validation failure rate,
  • human override rate,
  • task completion rate,
  • latency per tool,
  • cost per successful task,
  • escalation rate,
  • and user satisfaction for resolved workflows.

You should also run evals on:

  • correct tool selection,
  • refusal when tools are inappropriate,
  • safe handling of missing data,
  • approval behavior,
  • and final-answer faithfulness to tool outputs.

FAQ

What is tool use in an AI agent?

Tool use is the pattern where a model chooses from a defined set of tools, produces structured arguments, and then waits for your application to execute the tool and return results. The model is responsible for deciding what it wants to call, but your runtime remains responsible for validation, permissions, execution, and logging.

Do I need memory to build an AI agent with tool use?

No. Many useful agents work well without long-term memory. Start with a stateless tool-using workflow and add memory only when the product truly needs personalization, saved preferences, or continuity across sessions. Memory makes systems more powerful, but it also adds complexity, governance concerns, and debugging overhead.

What is the difference between tool calling and a real agent?

Tool calling is one capability. A real agent is the broader system around it: instructions, tools, orchestration, execution control, guardrails, approvals, retries, state handling, and observability. In other words, function calling helps the model request actions, but the surrounding runtime is what makes the whole system agentic and reliable.

What is the safest way to deploy a tool-using agent?

The safest rollout is narrow and staged. Start with read-only tools, validate all arguments, enforce deterministic permission checks, require approval for risky actions, log every step, and ship with evals and rollback paths. Expand autonomy only after you can measure tool behavior, not just answer quality.

Final thoughts

Building an AI agent with tool use is less about clever prompts and more about disciplined systems design. The model matters, but the real product quality comes from everything around it: tool design, schemas, orchestration, approvals, retries, observability, and a clear boundary between model judgment and deterministic business logic.

That is why the best tool-using agents do not try to be magical. They try to be dependable.

Start small. Give the agent a narrow job. Expose only the tools it truly needs. Keep write access constrained. Measure every step. Add memory only when the product proves it needs memory. Add autonomy only when the logs show the workflow is stable.

That approach is slower than demo-driven development, but it is how you build an agent that a real team can trust in production.

About the author

Elysiate publishes practical guides and privacy-first tools for data workflows, developer tooling, SEO, and product engineering.

Related posts