Rate Limits, Retries, Timeouts, and Token Budgets: The Unglamorous Plumbing of Production AI Agents

Production AI agents usually fail because the runtime around the model is too naive. This article explains how to design agent systems with queues, idempotency, classified retries, deadlines, token budgets, circuit breakers, and suppress on failure behavior.


This content originally appeared on HackerNoon and was authored by Raju Dandigam

Most AI agent tutorials end at the exciting part.

They show how to call a model, connect a tool, maybe add memory, and return a useful response. The demo works. The prompt looks clever. The agent calls the right API. Everyone feels good.

Then the same pattern moves into a real product and fails for reasons that have almost nothing to do with model intelligence.

The hotel search tool takes too long. The API gateway times out. The client retries. A second job starts. Both jobs call the same downstream provider. Both generate summaries. Both try to notify the user. Token usage doubles. The queue fills up. A fallback message says something friendly but inaccurate.

The model did not hallucinate. The architecture did.

That is where production agent engineering actually begins. The hard part is not always making the agent smarter. The hard part is building a runtime around the agent that behaves safely when the world is slow, duplicated, rate-limited, inconsistent, or partially broken.

This article is about that unglamorous layer: rate limits, retries, timeouts, idempotency, token budgets, circuit breakers, and suppress-on-failure behavior.

It is not the flashy part of agent development. But it is often the difference between an impressive demo and an agent system users can trust.

A Simple Agent That Fails in a Very Real Way

Imagine a price-watch agent for travel deals.

A user asks the system to monitor hotel prices for a destination. The application accepts the request, checks hotel availability, compares prices, asks a model to summarize the result, and sends a notification when there is something useful to report.

The first implementation may look completely reasonable.

\

async function runPriceWatchAgent(input: PriceWatchInput) {
  const hotels = await searchHotels(input);

  const deals = await comparePrices(hotels);

  const summary = await llm.generate({
    prompt: `Summarize these hotel deals: ${JSON.stringify(deals)}`
  });

  await sendNotification(input.userId, summary.text);
}

This works beautifully in a local demo. Each step has a clear purpose. The agent does exactly what the user asked.

The problem appears when one dependency slows down.

Suppose the hotel search API normally responds in two seconds but suddenly takes twenty. The HTTP request waits. The API gateway reaches its timeout. The client assumes the request failed and retries. Your backend receives the same request again and starts another run.

Now two agent workflows are running for the same user intent. Both workflows call the same hotel provider. Both spend tokens. Both may write state. Both may send notifications. If one succeeds and the other fails, the final user experience becomes nondeterministic.

This is not a prompt-engineering problem. It is a runtime problem.

The Three Failure Modes I Watch For

The first failure mode is the rate-limit death spiral. You process a batch of users, every run calls the model, and everything works until traffic increases. Suddenly the provider starts returning 429 responses. Your retry logic kicks in. More requests pile up. The retries themselves create more pressure.

\

async function processUsers(users: User[]) {
  for (const user of users) {
    await agent.execute(user); // model + tools inside
  }
}

\ This looks harmless with ten users. With hundreds of concurrent requests, it becomes a quota problem.

The second failure mode is the timeout cascade. A parent workflow waits thirty seconds for a child operation. The child operation has a sixty-second timeout. The parent fails at thirty seconds, but the child keeps running and writes state later.

const result = await childAgent.execute(task); // parent gives up first

Now the workflow says it failed, but a side effect still happened.

The third failure mode is the accidental duplicate. The first execution succeeds, but the response is lost because of a network hiccup. The caller retries. The same action executes again.

\

async function executeWithRetry(task: Task) {
  for (let attempt = 0; attempt < 3; attempt++) {
    try {
      return await execute(task);
    } catch (error) {
      if (attempt === 2) throw error;
    }
  }
}

This is fine for read-only work. It is dangerous for notifications, bookings, refunds, or anything with side effects.

Move Long Agent Work Out of the Request Path

The first rule is simple: long-running agent work should not live inside the original HTTP request.

The request path should validate input, create an idempotent job, and return quickly. The actual agent should run asynchronously in a worker. This gives you a better place to control retries, backoff, timeouts, and failures.

Here is a simplified BullMQ example.

\

import { Queue } from "bullmq";
import { z } from "zod";

const PriceWatchSchema = z.object({
  userId: z.string(),
  destination: z.string(),
  checkIn: z.string(),
  checkOut: z.string()
});

const priceWatchQueue = new Queue("price-watch", {
  connection: { host: "localhost", port: 6379 }
});

app.post("/price-watch", async (req, res) => {
  const input = PriceWatchSchema.parse(req.body);

  const idempotencyKey = [
    input.userId,
    input.destination,
    input.checkIn,
    input.checkOut
  ].join(":");

  await priceWatchQueue.add("run-price-watch-agent", input, {
    jobId: idempotencyKey,
    attempts: 3,
    backoff: {
      type: "exponential",
      delay: 2_000
    },
    removeOnComplete: true,
    removeOnFail: false
  });

  res.status(202).json({
    status: "accepted",
    jobId: idempotencyKey
  });
});

The important part is the jobId.

If the client retries the same request, the queue can treat it as the same logical job instead of creating unlimited duplicate work. This does not solve every duplicate side effect, but it gives the workflow a practical idempotency boundary.

The HTTP request no longer waits for the model, the hotel API, or the notification provider. It only confirms that the job was accepted.

Retry Less Than You Think

Retries are useful. They are also dangerous.

A temporary 503 from a downstream provider may deserve a retry. A network timeout may deserve a retry. A rate-limit response may deserve a delayed retry after respecting the provider’s cooldown window.

But a validation error should not retry. A policy violation should not retry. A malformed tool argument should not retry forever. A schema failure caused by your own code should probably fail fast and alert the team.

A better pattern is to classify the error before deciding what happens next.

\

type RetryDecision =
  | { action: "retry"; delayMs: number; reason: string }
  | { action: "fail_fast"; reason: string }
  | { action: "suppress"; reason: string };

class ValidationError extends Error {}
class PolicyViolationError extends Error {}
class TimeoutError extends Error {}

class RateLimitError extends Error {
  constructor(message: string, public retryAfterMs = 30_000) {
    super(message);
  }
}

function classifyError(error: unknown): RetryDecision {
  if (error instanceof ValidationError) {
    return {
      action: "fail_fast",
      reason: "Invalid input will not succeed on retry"
    };
  }

  if (error instanceof PolicyViolationError) {
    return {
      action: "suppress",
      reason: "The agent attempted a disallowed action"
    };
  }

  if (error instanceof RateLimitError) {
    return {
      action: "retry",
      delayMs: error.retryAfterMs,
      reason: "Provider rate limit"
    };
  }

  if (error instanceof TimeoutError) {
    return {
      action: "retry",
      delayMs: 5_000,
      reason: "Downstream timeout"
    };
  }

  return {
    action: "fail_fast",
    reason: "Unknown error type"
  };
}

Retries should be reserved for failures that are likely to be temporary. They should not be used to hide broken assumptions.

Add Backoff and Jitter

If every failed job retries after exactly five seconds, you may create a thundering herd. All workers wake up together and hit the same dependency again.

A small amount of jitter spreads the retry load.

\

function retryDelay(attempt: number) {
  const baseDelay = 1_000;
  const maxDelay = 30_000;

  const exponential = Math.min(
    baseDelay * 2 ** attempt,
    maxDelay
  );

  const jitter = Math.random() * 0.3 * exponential;

  return exponential + jitter;
}

This is not agent-specific. It is distributed-systems hygiene. But agent systems need it because a single user action can trigger model calls, tool calls, validators, callbacks, and follow-up jobs.

Add Deadlines, Not Just Timeouts

Most systems add a timeout to an HTTP request and stop there. Production agents need more than that.

An agent run should have a total deadline. Each tool call should have its own timeout. Model calls may need a separate timeout. The remaining time should be propagated down to child operations.

Otherwise, a child operation can keep running after the parent workflow has already failed.

\

interface AgentRunContext {
  runId: string;
  userId: string;
  deadlineAt: number;
  tokenBudget: TokenBudget;
}

function remainingTime(ctx: AgentRunContext) {
  return Math.max(0, ctx.deadlineAt - Date.now());
}

async function withTimeout<T>(
  operation: (signal: AbortSignal) => Promise<T>,
  timeoutMs: number
): Promise<T> {
  const signal = AbortSignal.timeout(timeoutMs);

  try {
    return await operation(signal);
  } catch (error) {
    if (signal.aborted) {
      throw new TimeoutError(`Timed out after ${timeoutMs}ms`);
    }

    throw error;
  }
}

async function executeTool<T>(
  ctx: AgentRunContext,
  name: string,
  fn: (signal: AbortSignal) => Promise<T>
): Promise<T> {
  const timeoutMs = Math.min(remainingTime(ctx), 10_000);

  if (timeoutMs <= 0) {
    throw new TimeoutError(`No time left before executing ${name}`);
  }

  return withTimeout(fn, timeoutMs);
}

\ If the agent has only two seconds left before its parent deadline, it should not start a tool call that normally takes ten seconds. It should fail predictably, record the reason, and let the runtime decide whether to retry, suppress, or notify the user that the operation could not be completed.

Deadlines prevent agent workflows from pretending they have infinite time.

Protect Side Effects With Idempotency

Queue-level idempotency helps prevent duplicate jobs. It does not automatically protect side effects.

Sending a notification, creating a booking, issuing a refund, updating a CRM record, or calling a callback URL should have its own idempotency key. Retrying the same logical action should not produce a different real-world outcome unless you explicitly allow it.

Redis is often enough for a practical first version.

\

async function runOnce<T>(
  key: string,
  ttlSeconds: number,
  operation: () => Promise<T>
): Promise<T | null> {
  const acquired = await redis.set(
    `idempotency:${key}`,
    "running",
    "NX",
    "EX",
    ttlSeconds
  );

  if (!acquired) {
    return null;
  }

  try {
    const result = await operation();

    await redis.set(
      `idempotency:${key}`,
      "completed",
      "EX",
      ttlSeconds
    );

    return result;
  } catch (error) {
    await redis.del(`idempotency:${key}`);
    throw error;
  }
}

Then wrap the side effect.

await runOnce(
  `notify:${ctx.runId}:${ctx.userId}`,
  60 * 60,
  () => sendNotification(ctx.userId, finalMessage)
);

Retries are not theoretical. Workers crash. Network calls timeout. Queues redeliver work. Users refresh browsers. API clients retry requests automatically.

If side effects are not idempotent, every retry becomes a risk.

Preserve State During Retries

Retries should not always restart the whole agent from step one.

If the agent already searched hotels successfully and only failed during summarization, it should not necessarily search hotels again. Save state after safe checkpoints.

\

type AgentState = {
  runId: string;
  completedSteps: string[];
  data: Record<string, unknown>;
};

async function executeStep(
  state: AgentState,
  stepName: string,
  fn: () => Promise<Record<string, unknown>>
) {
  if (state.completedSteps.includes(stepName)) {
    return state;
  }

  const result = await fn();

  const nextState = {
    ...state,
    completedSteps: [...state.completedSteps, stepName],
    data: { ...state.data, ...result }
  };

  await redis.set(
    `agent-state:${state.runId}`,
    JSON.stringify(nextState),
    "EX",
    60 * 60
  );

  return nextState;
}

This is especially useful when some steps are expensive, rate-limited, or have side effects.

A retry should resume from the last safe checkpoint, not blindly repeat the entire workflow.

Token Budgets Are Runtime Budgets

A traditional web request usually has a latency budget. A production agent also needs a token budget.

Token usage should not be something you discover at the end of the month in a billing dashboard. It should be part of the runtime contract for each run, session, or user.

interface TokenBudget {
  maxInputTokens: number;
  maxOutputTokens: number;
  usedInputTokens: number;
  usedOutputTokens: number;
}

class BudgetExceededError extends Error {}

function assertBudget(
  budget: TokenBudget,
  nextInputTokens: number,
  nextOutputTokens: number
) {
  if (budget.usedInputTokens + nextInputTokens > budget.maxInputTokens) {
    throw new BudgetExceededError("Input token budget exceeded");
  }

  if (budget.usedOutputTokens + nextOutputTokens > budget.maxOutputTokens) {
    throw new BudgetExceededError("Output token budget exceeded");
  }
}

Before calling the model, estimate the cost and reserve enough output space.

\

async function callModelWithBudget(
  ctx: AgentRunContext,
  prompt: string
) {
  const estimatedInputTokens = estimateTokens(prompt);
  const maxOutputTokens = 500;

  assertBudget(ctx.tokenBudget, estimatedInputTokens, maxOutputTokens);

  const result = await llm.generate({
    prompt,
    maxOutputTokens
  });

  ctx.tokenBudget.usedInputTokens += result.usage.inputTokens;
  ctx.tokenBudget.usedOutputTokens += result.usage.outputTokens;

  return result;
}

Without a budget, “try again” becomes an expensive architecture.

A good runtime should be able to say: this run is out of budget, stop safely.

Use Circuit Breakers for Broken Dependencies

Agents are good at trying alternatives. That can be useful when alternatives are safe. It is dangerous when the system keeps hammering a dependency that is already failing.

A circuit breaker protects both your system and the downstream provider. After repeated failures, the circuit opens and blocks calls for a cooldown period.

class CircuitBreaker {
  private failures = 0;
  private openedUntil = 0;

  constructor(
    private threshold = 5,
    private cooldownMs = 30_000
  ) {}

  canCall() {
    return Date.now() > this.openedUntil;
  }

  recordSuccess() {
    this.failures = 0;
  }

  recordFailure() {
    this.failures += 1;

    if (this.failures >= this.threshold) {
      this.openedUntil = Date.now() + this.cooldownMs;
    }
  }
}

Use it before external calls.

const hotelApiBreaker = new CircuitBreaker();

async function searchHotelsSafely(
  ctx: AgentRunContext,
  input: SearchHotelsInput
) {
  if (!hotelApiBreaker.canCall()) {
    throw new Error("Hotel API circuit is open");
  }

  try {
    const result = await executeTool(ctx, "search-hotels", signal =>
      fetch("https://hotel-api.example.com/search", {
        method: "POST",
        body: JSON.stringify(input),
        signal
      }).then(res => {
        if (!res.ok) {
          throw new Error(`Hotel API failed with ${res.status}`);
        }

        return res.json();
      })
    );

    hotelApiBreaker.recordSuccess();
    return result;
  } catch (error) {
    hotelApiBreaker.recordFailure();
    throw error;
  }
}

If a dependency is failing, the agent should not keep calling it just because it has more steps available.

Sometimes the Correct Fallback Is Silence

Fallbacks sound user-friendly, but in agent systems they can be dangerous.

If a price-watch agent cannot verify the latest hotel price, it should not generate a cheerful fallback like:

Good news, your hotel price may have dropped.

That is worse than failing.

In some workflows, the safest behavior is suppress-on-failure. Do not send the notification. Do not guess. Do not use stale data. Record why the action was suppressed.

\

async function finalizeNotification(
  ctx: AgentRunContext,
  result: AgentResult
) {
  if (result.status === "dependency_failed") {
    await logDecision({
      runId: ctx.runId,
      userId: ctx.userId,
      step: "finalize-notification",
      decision: "suppress_notification",
      reason: "Could not verify latest hotel price"
    });

    return;
  }

  await runOnce(
    `notify:${ctx.runId}:${ctx.userId}`,
    60 * 60,
    () => sendNotification(ctx.userId, result.message)
  );
}

The agent does not always need to say something. Sometimes reliability means refusing to act when the system cannot verify the facts.

Make Runtime Decisions Observable

Once retries, budgets, deadlines, idempotency, and circuit breakers exist, they need to be visible.

When an agent fails, “something went wrong” is not enough. Engineers need to know whether the run failed because the token budget was exhausted, the hotel API circuit was open, a side effect was suppressed, a retry limit was reached, or the parent deadline expired.

A simple decision log can make debugging much easier.

\

type RuntimeDecision = {
  runId: string;
  userId: string;
  step: string;
  decision: string;
  reason: string;
  timestamp: string;
};

async function logDecision(
  decision: Omit<RuntimeDecision, "timestamp">
) {
  await redis.rpush(
    `agent-decisions:${decision.runId}`,
    JSON.stringify({
      ...decision,
      timestamp: new Date().toISOString()
    })
  );
}

This is not a replacement for full observability, tracing, or metrics. But it creates a habit that matters: every important runtime decision should be explainable.

The Runtime Shape That Actually Works

A production agent runtime usually looks less like a chatbot and more like a distributed workflow.

A production agent runtime chart

The model is only one box in that diagram.

The rest of the system decides whether the model should be called, how long a tool can run, whether a retry is safe, whether a dependency is healthy, whether a notification should be sent, and whether the run has already spent too much money.

That is the unglamorous part.

It is also the part that makes the system production-ready.

A Small Test That Catches a Big Mistake

This kind of runtime code is worth testing directly.

At minimum, test that retryable errors retry and non-retryable errors do not.

\

describe("classifyError", () => {
  it("retries rate limits", () => {
    const decision = classifyError(
      new RateLimitError("Too many requests", 10_000)
    );

    expect(decision.action).toBe("retry");
  });

  it("does not retry validation errors", () => {
    const decision = classifyError(
      new ValidationError("Invalid destination")
    );

    expect(decision.action).toBe("fail_fast");
  });

  it("suppresses policy violations", () => {
    const decision = classifyError(
      new PolicyViolationError("Unsafe notification")
    );

    expect(decision.action).toBe("suppress");
  });
});

These tests are simple, but they protect the behavior that matters. Your retry policy should not change accidentally because someone added a broad catch block.

What I Check Before Trusting an Agent

Before I trust an agent workflow in production, I want to know whether every run has an idempotency key. I want to know whether the workflow survives a client retry without duplicating work. I want retries to be classified by error type, not sprayed across every failure.

I want every external tool call to have a timeout. I want the parent deadline to propagate to child operations. I want side effects to be idempotent. I want token budgets to exist per run, session, or user. I want the system to stop calling a dependency that is clearly failing.

Most importantly, I want the agent to know when not to act.

If a dependency fails, the system should not invent confidence. If data cannot be verified, the agent should not produce a polished guess. If a user-facing action is unsafe, the workflow should suppress it and record why.

That is what separates production readiness from demo readiness.

Final Thoughts

Production AI agents are not just prompts with tools attached.

They are distributed systems with probabilistic decision-making inside them. That means the old engineering problems still matter: retries, rate limits, timeouts, queues, locks, budgets, observability, and failure handling.

The model may be the most visible part of the system, but it is rarely the only part that fails.

If you want agents users can trust, spend less time asking whether the agent sounds intelligent and more time asking whether the runtime behaves safely when everything around it is slow, duplicated, rate-limited, or partially broken.

That is where real production readiness starts.

\


This content originally appeared on HackerNoon and was authored by Raju Dandigam


Print Share Comment Cite Upload Translate Updates
APA

Raju Dandigam | Sciencx (2026-06-02T02:43:36+00:00) Rate Limits, Retries, Timeouts, and Token Budgets: The Unglamorous Plumbing of Production AI Agents. Retrieved from https://www.scien.cx/2026/06/02/rate-limits-retries-timeouts-and-token-budgets-the-unglamorous-plumbing-of-production-ai-agents/

MLA
" » Rate Limits, Retries, Timeouts, and Token Budgets: The Unglamorous Plumbing of Production AI Agents." Raju Dandigam | Sciencx - Tuesday June 2, 2026, https://www.scien.cx/2026/06/02/rate-limits-retries-timeouts-and-token-budgets-the-unglamorous-plumbing-of-production-ai-agents/
HARVARD
Raju Dandigam | Sciencx Tuesday June 2, 2026 » Rate Limits, Retries, Timeouts, and Token Budgets: The Unglamorous Plumbing of Production AI Agents., viewed ,<https://www.scien.cx/2026/06/02/rate-limits-retries-timeouts-and-token-budgets-the-unglamorous-plumbing-of-production-ai-agents/>
VANCOUVER
Raju Dandigam | Sciencx - » Rate Limits, Retries, Timeouts, and Token Budgets: The Unglamorous Plumbing of Production AI Agents. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2026/06/02/rate-limits-retries-timeouts-and-token-budgets-the-unglamorous-plumbing-of-production-ai-agents/
CHICAGO
" » Rate Limits, Retries, Timeouts, and Token Budgets: The Unglamorous Plumbing of Production AI Agents." Raju Dandigam | Sciencx - Accessed . https://www.scien.cx/2026/06/02/rate-limits-retries-timeouts-and-token-budgets-the-unglamorous-plumbing-of-production-ai-agents/
IEEE
" » Rate Limits, Retries, Timeouts, and Token Budgets: The Unglamorous Plumbing of Production AI Agents." Raju Dandigam | Sciencx [Online]. Available: https://www.scien.cx/2026/06/02/rate-limits-retries-timeouts-and-token-budgets-the-unglamorous-plumbing-of-production-ai-agents/. [Accessed: ]
rf:citation
» Rate Limits, Retries, Timeouts, and Token Budgets: The Unglamorous Plumbing of Production AI Agents | Raju Dandigam | Sciencx | https://www.scien.cx/2026/06/02/rate-limits-retries-timeouts-and-token-budgets-the-unglamorous-plumbing-of-production-ai-agents/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.