Advanced13 min readAutomations

Designing agents that don't loop forever

The most common production agent failure is infinite or pseudo-infinite loops — agents that retry, branch, and burn through tokens without making progress. The architectural patterns that prevent this and produce agents that finish, even on hard tasks.

What you should be able to do

Evaluate the implementation pattern, failure modes, and guardrails before building.

May 15, 2026

In this article

A common shape of expensive production AI failure: it's 3 AM on a Tuesday and a customer-support agent goes into an infinite loop. It calls the same tool, gets the same error, retries with a slightly different parameter, gets the same error, repeats. Hundreds of times per minute. By morning, the team has a four- or five-figure surprise bill — depending on the model and the call rate.

This isn't rare. Agents that loop are one of the most common — and most consequential — production failures. They're insidious because they often "look like they're working" — the agent is taking actions, calls are succeeding (or failing predictably). Only when you check the trace do you see the same pattern repeating.

Building agents that don't loop forever requires deliberate architectural choices. Most agent failures are predictable; the patterns to prevent them are well-known. The teams that ship reliable agents are the ones who implement these patterns rigorously.

This article covers what makes agents loop, the architectural patterns that prevent it, and the operational guardrails that catch loops when they slip through.

Why agents loop

A few mechanisms behind agent loops:

1. Confusion about progress

The agent doesn't have a clear sense of "have I made progress?" It tries something, observes a result, decides to try something else. Without explicit progress tracking, "try something else" can mean "the same thing slightly differently."

2. Lack of termination criteria

The agent's prompt says "help the user," but doesn't say "stop when X is true." Without clear termination, the agent keeps "helping" — searching for one more piece of information, trying one more tool, refining further.

3. Retry pathology

When something fails, agents naturally retry. Without retry budgets, the same failure can be retried indefinitely. The agent perceives "I haven't succeeded yet" not "I've already tried this 10 times."

4. State amnesia

The agent's working memory only contains recent turns. If a loop has cycled through 20 attempts, the agent might only see the last 5 in context — losing the pattern that's now obvious to outside observers.

5. Tool inconsistency

A tool returns confusing or contradictory results. The agent tries again. The tool still returns confusing results. The agent reasons "maybe my previous interpretation was wrong" and tries differently. The tool still returns confusing results. Loop.

6. Excessive optimism

The agent's training makes it persistent — it keeps trying when a better strategy would be to stop and ask for help. Particularly bad in long-horizon tasks where small confusions compound.

7. Goal drift

The agent gradually loses sight of what it was trying to accomplish. It branches into sub-tasks, then sub-sub-tasks, exploring tangentially related areas without returning to the main goal.

Different agents fail in different ways. The defenses overlap.

Pattern 1: Hard step budgets

The simplest and most important defense: a maximum number of steps. The agent has, say, 20 tool calls available. After 20, it must produce a final answer (or escalate).

Implementation:

def agent_loop(query, max_steps=20):
    messages = [{"role": "user", "content": query}]
    for step in range(max_steps):
        response = call_llm(messages, tools=available_tools)
        if response.is_final_answer:
            return response.content
        result = execute_tool(response.tool_call)
        messages.append(response)
        messages.append({"role": "tool", "content": result})
    # Hit budget — force final answer
    return force_final_answer(messages)

The budget should be calibrated to the task. Simple tasks: 5-10 steps. Complex multi-source tasks: 20-30. Open-ended research: 50+. But always bounded.

When the budget is hit, the agent produces its best answer with what it knows. Or escalates to a human.

This single pattern prevents most catastrophic loops. Always implement it.

Variants

Token budget. Instead of (or in addition to) step count, limit total tokens. Prevents agents that take fewer steps but each step is a 50K-token reasoning trace.

Cost budget. Translates step/token budgets into euros. Prevents budget overruns from being abstract.

Time budget. Wall-clock limit. Useful for user-facing flows ("respond within 30 seconds").

Most production agents have all four budgets in some form. Hitting any one terminates the run.

Pattern 2: Progress tracking

A budget alone doesn't tell the agent it's stuck — it just stops it eventually. Progress tracking helps the agent recognize a loop and break out.

A simple implementation: the agent maintains an explicit "progress" log. Each step, it states what new information it gained or what changed.

Step 1: Searched for customer "Smith". Found 12 matches.
Step 2: Filtered to active accounts. 4 remain.
Step 3: Checked recent activity. Customer 234 had a recent ticket about pricing.
Step 4: Pulled the ticket details. The complaint was about a recent price change.
Step 5: Drafted response. Ready to send.

Each step adds new information. If the agent does step 6 and the progress log gains nothing — same search, same results, same conclusion — it's spinning.

The prompt can include:

Before deciding the next action, summarize what you've learned in the last few steps. If you haven't gained new information in the last 3 steps, stop and either:
- Produce your best answer with current information.
- Escalate the issue: explain what you've tried and what's missing.

This makes "lack of progress" visible to the agent so it can react.

Pattern 3: Repeat detection

Sometimes agents repeat the exact same tool call. This is easy to detect programmatically.

def detect_repeat(history):
    recent_calls = [c for c in history[-5:] if c.is_tool_call]
    if len(recent_calls) < 3:
        return False
    call_signatures = [(c.tool, json.dumps(c.args, sort_keys=True)) for c in recent_calls]
    return len(set(call_signatures)) < len(call_signatures) / 2

If a repeat is detected, intervene:

Inject a message: "You've called this tool with these parameters recently. The results haven't changed. Try a different approach or terminate."
Or force termination.

This catches the most obvious loops automatically.

Pattern 4: Stuck-state detection

Beyond exact repeats, you can detect more subtle stuck states:

Pattern recognition. Use a separate LLM call to evaluate: "Looking at the last 5 steps, is this agent making progress?" If not, break out.

def is_stuck(history):
    recent = format_history(history[-5:])
    response = call_llm(
        system="You are evaluating whether an agent is making progress.",
        user=f"Recent agent steps:\n{recent}\n\nIs the agent making meaningful progress or stuck in a loop? Answer: progressing | stuck."
    )
    return response.content.strip() == "stuck"

Run this check every few steps. If it returns "stuck," intervene.

Tool diversity. If the agent has called only 1 tool for 5+ steps, that's suspicious. Force it to try something else or stop.

Error patterns. If the same tool has returned the same error 3+ times, stop using it. The agent isn't going to figure out the missing input from more retries.

Pattern 5: Reflection points

At specific points in the agent's run, force explicit reflection.

After every 5 steps, the agent must produce a reflection:

1. What was my original goal?
2. What have I learned so far?
3. What do I still need to know?
4. Am I making progress, or repeating?
5. Should I continue or stop?

The reflection forces the agent to step back from the immediate next-action thinking and evaluate the bigger picture.

This is particularly effective for long-horizon tasks. Without forced reflection, agents drift; with it, they catch their own drift.

Pattern 6: Goal anchoring

In long agent runs, the original goal gets lost. The agent's context window fills with intermediate steps; the original question becomes a small part of a large context.

Counter this by anchoring the goal repeatedly:

Include the original goal at the top of every system message.
Have the agent restate the goal every N steps.
Use a separate "goal tracker" that confirms each step is aligned with the goal.

Example prompt addition:

Original goal: [verbatim user request]

Before each action, confirm:
- Is this action helping me toward the original goal?
- If yes, proceed.
- If no, return to the goal directly.

Pattern 7: Sub-task delimitation

Long agents naturally break into sub-tasks. Without structure, sub-tasks can spawn sub-sub-tasks recursively until the agent is lost.

Provide structure:

The agent identifies sub-tasks explicitly.
Each sub-task has its own budget.
After completing (or failing) a sub-task, the agent returns to the main task.
Sub-tasks cannot spawn unbounded sub-sub-tasks.

This is what frameworks like LangGraph try to formalize — a state machine where each node is a clear step, with explicit transitions.

For complex agents, this structure is essential. For simple ones, it's overkill.

Pattern 8: Escape hatches

When an agent gets stuck, it needs explicit ways to stop:

Escalate. "I cannot complete this task. Here's what I've tried and what's missing." The agent stops and surfaces the issue.

Partial completion. "I've completed parts A and B. C is blocked by X." The agent doesn't have to fully succeed; it can produce useful partial output.

Clarification. "I need more information from the user: ..." The agent pauses and asks.

These should be first-class options for the agent, not last resorts. The agent's prompt should mention them and encourage their use when stuck.

A useful prompt addition:

If you encounter any of these situations, stop trying and respond appropriately:
- A tool consistently returns the same error.
- You've tried 3 different approaches without progress.
- You need information only the user can provide.
- The task is more complex than your tools support.

In these cases:
- For tool errors: explain the issue, suggest the user contacts support.
- For lack of progress: report what you've tried and ask for guidance.
- For missing information: ask the user a specific question.
- For complexity: escalate to human assistance with a summary.

Pattern 9: Confidence-aware actions

The agent should know when it's confident and when it's not. Acting on low confidence is how loops start.

A pattern: every consequential action requires explicit confidence.

Before calling delete_record, state your confidence on a 1-5 scale that this is the right action. If <4, do not call. Instead, ask for human confirmation.

This works particularly well for destructive or expensive actions. The agent has to commit to high confidence before taking them.

Combined with reflection, this catches cases where the agent is "trying things" rather than "executing a plan."

Pattern 10: Tool-level guards

Beyond agent-level patterns, tools themselves can have guards:

Rate limiting per session. A tool can only be called N times in a session. After N, returns "rate limit." Forces the agent to do something else.

Idempotency. Repeated identical calls return the cached result without re-executing. Prevents loops that hammer a tool.

Cost caps. Expensive tools (heavy DB queries, third-party APIs with usage costs) have per-session limits.

Failure circuit-breakers. A tool that's failed 3 times in this session is disabled. The agent can no longer call it.

These complement agent-level patterns. The agent might try to loop, but the tool prevents it.

Pattern 11: External monitoring

For all the in-agent patterns, an external monitor catches what slips through.

A monitoring process watches all running agents. It checks:

Step count per agent.
Token usage per agent.
Cost per agent.
Time per agent.
Tool call patterns.

When any agent exceeds thresholds, kill it. Send an alert.

This is the last line of defense. Even if the agent itself is broken, the monitor catches it before it damages the wallet.

In implementation:

A timeseries database tracks agent metrics.
Rules trigger kill orders ("if agent has run for > 5 minutes, kill").
A small service watches and enforces.

For systems running many agents simultaneously, this is essential.

Pattern 12: Human-in-the-loop checkpoints

For high-stakes agents, build in human checkpoints. The agent runs until a checkpoint, then waits for human approval.

Typical checkpoints:

Before destructive actions.
After a decision the agent can't reverse.
At major milestones in a long task.
When confidence drops.

This isn't about distrust — it's about catching errors when they're cheap to fix.

A practical workflow: the agent does prep work autonomously, surfaces a summary and proposed actions, human approves, agent executes. The human is in the loop for decisions, not in the loop for every step.

A worked example: a long-running research agent

To illustrate, the patterns applied to a real agent:

Task: Research a competitor and produce a brief.

Estimated work: 10-30 web searches, 20-50 page reads, synthesis into a 1000-word brief.

Patterns applied:

Step budget: 60 steps total.

Token budget: 300K tokens (context + actions). If exceeded, summarize current findings and continue.

Cost budget: €2 per run. If exceeded, stop and return partial brief.

Time budget: 5 minutes wall-clock.

Progress tracking: Every step, agent updates a "findings log" with new information. If 3 steps pass without new findings, escape.

Repeat detection: If the same search query is run twice with similar results, force a different approach.

Reflection points: Every 10 steps, agent reflects on progress and remaining work.

Goal anchoring: Original brief target at top of every system message.

Escape hatches: "I have enough information" or "I cannot find sufficient information" both terminate the agent gracefully.

External monitor: Independent watcher kills agents exceeding budgets.

Outcome: Median run time 3 minutes. Median cost €0.40. Failure rate (loops or timeouts) < 1%. Briefs produced are 700-1200 words, factually grounded, useful starting points.

Without these patterns: occasional 30-minute runs, occasional €20+ costs, occasional bricked sessions. The patterns reduce the tail dramatically.

Detection in production

Even with patterns, occasional issues slip through. Detect them:

Alerts on long-running agents. Any agent > 2x median duration triggers alert.

Alerts on cost spikes. Per-agent or aggregate cost above threshold.

Alerts on repeat patterns. Tool call patterns that suggest loops.

Daily review of long-running traces. A human glances at the longest 10 traces per day. Catches issues evals miss.

Aggregate metrics: loop rate over time. Catches when something changes (model update, prompt change) that increases loop frequency.

A useful dashboard: distribution of agent run lengths. The tail tells you about loop incidence.

Common mistakes

A few patterns we see repeatedly:

Mistake 1: No step budget. "We'll add it if we need it." Then the agent loops at 3 AM and you wish you'd added it. Always add it from day one.

Mistake 2: Too-high budgets. "100 steps should be plenty" — but a loop fills it. Set budgets to the median + 2-3x, not the worst case.

Mistake 3: No external monitor. Trusting the agent to stop itself. Sometimes it doesn't. External monitor is essential for production.

Mistake 4: Catch loops but no analysis. Loop happens, monitor kills it, team moves on. The same loop happens next week. Always do post-mortems on caught loops — what triggered it, what changed, can we prevent the class of failure?

Mistake 5: Aggressive reflection on simple tasks. Forcing reflection every 5 steps on a 5-step task is overhead without benefit. Calibrate to task complexity.

Mistake 6: Goals lost in long contexts. A goal mentioned once at step 1 doesn't survive to step 50. Re-anchor regularly.

Mistake 7: Trusting agent self-reports of progress. Agents will say they're making progress when they're not. Verify externally where possible.

Mistake 8: Letting agents call themselves recursively. "Decompose this task into sub-agents" can produce exponential agent spawns. If you allow it, budget it strictly.

When loops are acceptable

Not all loops are bad. Some tasks legitimately require many iterations:

Iterative refinement of code (write, test, fix, repeat).
Multi-step research with branching.
Optimization tasks (try variations, evaluate, refine).

For these, loops are the work, not a failure. The patterns shift:

Generous step budgets (50-200 steps).
Explicit "iteration" framing, not "loop" framing.
Quality improvement tracking — each iteration should improve a metric.
Hard stop when improvement plateaus.

The principle: distinguish "intended iterative work" from "unintended loops." Apply the patterns appropriately to each.

The takeaway

Agents that loop forever are predictable, common, and preventable. The patterns are well-known: step budgets, progress tracking, repeat detection, reflection, goal anchoring, escape hatches, tool guards, external monitors.

These aren't optional polish. They're the difference between agents that ship and agents that produce surprise four-figure bills.

For any production agent, the checklist:

[ ] Maximum step budget.
[ ] Maximum token budget.
[ ] Maximum cost budget.
[ ] Maximum time budget.
[ ] Repeat call detection.
[ ] Progress tracking.
[ ] Periodic reflection.
[ ] Goal anchoring.
[ ] Multiple escape hatches.
[ ] External monitor with kill capability.

Each is simple to implement. Together, they make the difference between "this agent is dangerous to leave running" and "this agent is reliable in production."

Build the patterns in. Test them. The tail risk you eliminate is worth the work many times over.

Take it further

Hand-picked external courses that go deeper on this topic.

Coursera · Vanderbilt University

ChatGPT: Excel at Personal Automation with GPTs, AI & Zapier

Dr. Jules White

The clearest path from "I use ChatGPT in a tab" to "my AI handles my inbox while I sleep." Three-course specialization built around Zapier — no Python required. By the end you'll have agents that summarise emails, update spreadsheets, and trigger workflows when conditions are met.

Beginner~30 hours · 3-course specializationVerified 25 days ago

Hugging Face

AI Agents Course