Tool Calling is not Architecture

This content originally appeared on Level Up Coding - Medium and was authored by Ricardo Cataldi

Why Python engineers need boundaries before agent demos become systems

Photo by Pankaj Patel on Unsplash

Tool calling is the easiest part of an agent system to demo and the easiest part to mistake for architecture.

In a demo, the agent receives a prompt, chooses a tool, passes a few arguments, and returns a pleasing answer. The moment feels complete because the interaction loop is visible. Something asked for work. Something performed work. Something came back.

Production is less impressed.

A production system needs to answer a different set of questions. Who owns the capability? What happens when the tool is unavailable? Which context is allowed to cross the boundary? How is the call traced? How does the caller know whether retrying is safe? What is the contract between the probabilistic side of the system and the deterministic side?

Those questions are not about tool calling. They are about architecture.

A Demo Proves Reachability

A tool is a capability interface. Architecture is the set of boundaries, contracts, policies, and feedback loops that make that capability safe to use repeatedly.

That distinction matters because agent systems often fail in the space between a successful demo and a repeatable operation. A demo can ignore latency spikes. A system cannot. A demo can tolerate a vague schema. A system cannot. A demo can let an LLM pass whatever it inferred into a backend service. A system needs validation, translation, and a clear domain boundary.

This is where the microservices lens becomes useful again. Microservices did not become valuable because teams learned how to expose HTTP endpoints. They became valuable when teams learned to draw ownership boundaries, version contracts, isolate failure, and operate services independently.

Agent systems need the same discipline.

When an agent calls a tool that creates an order, reads a profile, changes a ticket, or recommends an action, it is not merely calling a function. It is crossing from cognitive context into operational context. The cognitive side may be exploratory. The operational side usually needs invariants.

That crossing deserves a design.

A Boundary Has Responsibilities

A good tool boundary narrows intent. The agent should not get a general-purpose escape hatch when the domain needs a specific action. A tool named execute_business_operation is not a boundary. It is a disguised remote shell. A tool named quote_shipping_options or summarize_support_case is closer to a contract because the purpose, input shape, and expected output can be reviewed.

A good tool boundary also translates models. Agents often reason in flexible language. Services often operate on strict domain models. The boundary should convert between those worlds deliberately instead of letting informal natural language leak into deterministic services.

It also absorbs failure. A tool boundary should know when to time out, when to retry, when to return a partial result, and when to stop. Without that policy, the agent becomes responsible for infrastructure judgment it does not actually own.

And it creates evidence. Every call should leave enough trace data to explain what happened later: the input contract, the tool version, the result category, the latency, and the decision path that followed.

The boundary is the first place where a nice demo becomes an engineered system. It is where the model-facing description, the domain command, the permission model, the service adapter, the retry policy, and the trace record meet. If all of that lives only in the prompt, the architecture is invisible. If it lives in code, the system can be reviewed, tested, monitored, and changed.

A Tool Boundary In Python

Here is a compact example. The agent is allowed to ask for shipping options, but it is not allowed to call the shipping provider directly. The tool boundary validates the request, calls a narrow domain service, classifies the result, and emits trace events that a production system could send to logs or OpenTelemetry.

from __future__ import annotations

from dataclasses import dataclass
from enum import Enum
from time import perf_counter
from uuid import uuid4


class ResultCategory(str, Enum):
 OK = "ok"
 INVALID_INPUT = "invalid_input"
 TEMPORARY_FAILURE = "temporary_failure"
 REFUSED = "refused"


@dataclass(frozen=True)
class ShippingQuoteRequest:
 order_id: str
 destination_country: str
 weight_kg: float


@dataclass(frozen=True)
class ShippingQuote:
 carrier: str
 price_usd: float
 estimated_days: int


@dataclass(frozen=True)
class ToolTrace:
 correlation_id: str
 tool_name: str
 tool_version: str
 latency_ms: float
 category: ResultCategory


@dataclass(frozen=True)
class ToolResult:
 category: ResultCategory
 quotes: tuple[ShippingQuote, ...]
 message: str
 trace: ToolTrace


class ShippingProvider:
 def quote(self, request: ShippingQuoteRequest) -> tuple[ShippingQuote, ...]:
  if request.destination_country == "BR":
   return (ShippingQuote("Standard", 19.90, 6), ShippingQuote("Express", 39.90, 2))
  return (ShippingQuote("International", 59.90, 10),)


def validate_quote_request(request: ShippingQuoteRequest) -> str | None:
 if not request.order_id.strip():
  return "order_id is required"
 if len(request.destination_country) != 2:
  return "destination_country must use ISO alpha-2 shape"
 if request.weight_kg <= 0 or request.weight_kg > 50:
  return "weight_kg must be between 0 and 50"
 return None


def quote_shipping_options(
 request: ShippingQuoteRequest,
 provider: ShippingProvider,
 *,
 correlation_id: str | None = None,
) -> ToolResult:
 started = perf_counter()
 correlation = correlation_id or str(uuid4())
 validation_error = validate_quote_request(request)
 if validation_error:
  category = ResultCategory.INVALID_INPUT
  quotes: tuple[ShippingQuote, ...] = ()
  message = validation_error
 else:
  try:
   quotes = provider.quote(request)
   category = ResultCategory.OK
   message = "quotes returned"
  except TimeoutError:
   quotes = ()
   category = ResultCategory.TEMPORARY_FAILURE
   message = "shipping provider timed out"

 trace = ToolTrace(
  correlation_id=correlation,
  tool_name="quote_shipping_options",
  tool_version="2026-05-10",
  latency_ms=(perf_counter() - started) * 1000,
  category=category,
 )
 return ToolResult(category=category, quotes=quotes, message=message, trace=trace)

This is not a lot of code, but it establishes important ownership lines. The agent can ask for a bounded capability. The tool boundary owns validation and result classification. The provider owns shipping data. The trace owns evidence. The model does not need to remember all of that policy because the boundary does.

This is also why vague tool names are so dangerous. A tool named business_operation has no obvious contract. A tool named quote_shipping_options has a reviewable purpose, a stable input shape, and a result category that downstream code can reason about.

Failure Policy Is Part Of The Contract

Most agent demos treat failure as a conversational inconvenience: the agent apologizes and tries again. Production systems need a more explicit policy. Some operations are safe to retry. Some are safe only with an idempotency key. Some should return partial data. Some should refuse the request. Some should escalate to a person.

The retry decision should not be invented by the model at the moment of failure. It should be part of the tool contract.

def call_with_single_retry(
 request: ShippingQuoteRequest,
 provider: ShippingProvider,
) -> ToolResult:
 first = quote_shipping_options(request, provider)
 if first.category != ResultCategory.TEMPORARY_FAILURE:
  return first

 second = quote_shipping_options(
  request,
  provider,
  correlation_id=first.trace.correlation_id,
 )
 if second.category == ResultCategory.OK:
  return second

 return ToolResult(
  category=ResultCategory.TEMPORARY_FAILURE,
  quotes=(),
  message="shipping provider failed after retry",
  trace=second.trace,
 )

Notice the small but important detail: the retry preserves the correlation ID. That lets an operator see the original attempt and the retry as one logical operation. In a real system you would also attach tenant, user, policy, and version metadata. The central design point stays the same: the architecture records why the system did what it did.

Testing The Boundary

A tool boundary should be boring to test. That is a compliment. If the only way to test the system is to run the whole agent loop, the boundary is too implicit.

class TimeoutProvider(ShippingProvider):
 def quote(self, request: ShippingQuoteRequest) -> tuple[ShippingQuote, ...]:
  raise TimeoutError("simulated timeout")


def test_rejects_invalid_weight() -> None:
 result = quote_shipping_options(
  ShippingQuoteRequest("ORD-1", "BR", -1),
  ShippingProvider(),
 )

 assert result.category == ResultCategory.INVALID_INPUT
 assert result.quotes == ()
 assert "weight_kg" in result.message


def test_returns_quotes_for_valid_request() -> None:
 result = quote_shipping_options(
  ShippingQuoteRequest("ORD-1", "BR", 2.5),
  ShippingProvider(),
 )

 assert result.category == ResultCategory.OK
 assert result.quotes[0].carrier == "Standard"
 assert result.trace.tool_name == "quote_shipping_options"


def test_retry_preserves_correlation_id() -> None:
 result = call_with_single_retry(
  ShippingQuoteRequest("ORD-1", "BR", 2.5),
  TimeoutProvider(),
 )

 assert result.category == ResultCategory.TEMPORARY_FAILURE
 assert result.trace.correlation_id

This test suite does not prove the model will always choose the right tool. It proves something more useful: when the tool is called, the operational boundary behaves predictably. The agent layer can now be evaluated separately: tool selection, prompt quality, context quality, and policy instruction adherence.

That separation is a serious engineering advantage. It gives you smaller failure domains. When a production incident happens, you can ask whether the model chose the wrong capability, whether the tool rejected valid input, whether the provider failed, or whether the workflow made a bad follow-up decision. Without a boundary, all of those questions collapse into “the agent did something weird.”

Protocols Need Architecture

This is why I prefer to teach MCP and A2A as protocol fluency rather than as a collection of clever integrations. The question is not only, “Can my agent call this?” The better question is, “Can this capability become a reliable boundary in a larger system?”

MCP helps structure how tools and resources are exposed to models. A2A-style thinking helps reason about agent-to-agent collaboration. But neither protocol removes the need for architecture. Protocols create the language. Architecture decides what should be said, who is allowed to say it, and what happens when the conversation fails.

The practical move for engineers is to review every agent tool as if it were a service boundary.

Ask what the tool owns. Ask what it refuses. Ask which data can enter and leave. Ask whether the operation is idempotent. Ask how failure is represented. Ask what a future maintainer would need to debug the call at 2am.

If those answers are missing, the system is still a demo, even if the tool call works.

Review Checklist For Production Agent Tools

Before publishing a tool to an agent runtime, review it like a service contract:

Ask what business capability the tool exposes. This prevents generic escape hatches from sneaking into the system under a friendly tool name.
Ask who owns the data and behavior behind the tool. That keeps domain responsibility clear instead of hiding ownership inside the agent prompt.
Ask what input is rejected before execution. Prompt ambiguity should not become operational damage just because the payload had the right JSON shape.
Ask which failures are retryable. This prevents duplicate writes, accidental loops, and the classic “the agent tried to be helpful and made accounting sad” incident.
Ask what the result category means. Workflows need deterministic follow-up decisions, not vibes wrapped in a string.
Ask what trace metadata is emitted. Incidents become much less mystical when the call leaves evidence.
Ask what policy or permission check happens before side effects. Governance belongs in a reviewable boundary, not in the model’s short-term memory.

This checklist is not ceremony. It is how you decide whether the tool belongs in a production system or only in a demo notebook.

What To Try Next

Python agent work is not short on tools. It is short on contracts.

If this made agent protocols feel like architecture rather than glue code, the next step is to build the boundary yourself. My Udemy course, MCP and A2A in Python, walks through MCP servers, clients, tool integrations, async workflows, and A2A-style collaboration in Python.

Tool Calling is not Architecture was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding - Medium and was authored by Ricardo Cataldi

Print Share Comment Cite Upload Translate Updates

APA

Ricardo Cataldi | Sciencx (2026-05-18T15:45:41+00:00) Tool Calling is not Architecture. Retrieved from https://www.scien.cx/2026/05/18/tool-calling-is-not-architecture/

MLA

" » Tool Calling is not Architecture." Ricardo Cataldi | Sciencx - Monday May 18, 2026, https://www.scien.cx/2026/05/18/tool-calling-is-not-architecture/

HARVARD

Ricardo Cataldi | Sciencx Monday May 18, 2026 » Tool Calling is not Architecture., viewed ,<https://www.scien.cx/2026/05/18/tool-calling-is-not-architecture/>

VANCOUVER

Ricardo Cataldi | Sciencx - » Tool Calling is not Architecture. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2026/05/18/tool-calling-is-not-architecture/

CHICAGO

" » Tool Calling is not Architecture." Ricardo Cataldi | Sciencx - Accessed . https://www.scien.cx/2026/05/18/tool-calling-is-not-architecture/

IEEE

" » Tool Calling is not Architecture." Ricardo Cataldi | Sciencx [Online]. Available: https://www.scien.cx/2026/05/18/tool-calling-is-not-architecture/. [Accessed: ]

rf:citation

» Tool Calling is not Architecture | Ricardo Cataldi | Sciencx | https://www.scien.cx/2026/05/18/tool-calling-is-not-architecture/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.