
Loop Engineering: From Prompts to Verified Agent Loops
Summary
Build a plan-act-verify agent loop with an external check, retry budget, and clear stop rules.
Loop Engineering: From Prompts to Verified Agent Loops
In the second week of June 2026 a single sentence reorganized how a lot of developers talk about AI: stop prompting your coding agent, and start designing the loop that prompts it for you. The phrase that stuck was loop engineering, and within 48 hours it was the dominant topic across r/LocalLLaMA, Hacker News, and developer X, with thousands of posts and a wave of hot takes.
Here is the thing the hype mostly gets wrong. Loop engineering did not blow up because looping is new. The pattern of act, check, retry is years old. It went viral because one-shot prompts quietly stopped being good enough for real work, and "design the loop" is the most legible name anyone has given to the fix. The catch: a loop is only as good as the thing that decides when to stop. Get the verifier wrong and you have built an expensive way to be confidently broken.
This guide teaches the part that actually matters and that most posts skip: building a loop whose stopping decision comes from an external, objective check instead of the model grading its own homework. You will build a working plan-act-verify loop in Python, run it against a real test suite, watch it self-correct, then swap the stand-in model for a real Claude agent using tool use. Every code block here runs.
What loop engineering actually is
Prompt engineering optimizes a single message. Loop engineering optimizes the system around the model: how it gathers context, takes an action, observes a result, and decides whether to go again. The model is one component; the loop is the product.
Most modern agent loops trace back to the ReAct pattern (Reason + Act) from Princeton and Google: interleave a reasoning step with an action step, feed the observation back, repeat. Loop engineering is ReAct grown up for production, with three things bolted on that decide whether it survives contact with reality:
- An objective verifier — a test suite, compiler, type checker, schema validator, or any check that returns pass/fail without asking the model's opinion.
- A retry budget — a hard cap on iterations (and ideally on tokens and wall-clock time) so a stuck agent fails loudly instead of burning your bill.
- Explicit stop rules — clear conditions for done, blocked, and give-up, so the loop always terminates in a known state.
The single most common failure in the wild is letting the agent verify itself: "Does this look correct? Yes." That is not verification, it is vibes. The check in a well-engineered loop is external and objective. The agent proposes; something outside the agent disposes.
Prerequisites
- Python 3.10 or newer.
- Comfort with functions, subprocess, and basic exceptions.
- For the production section:
pip install anthropicand an API key inANTHROPIC_API_KEY. The local demo needs no key and no network.
The anatomy of a verified loop
Every verified loop is the same four beats, regardless of framework:
- Act: the model produces a candidate (code, a plan, a patch, a structured answer).
- Verify: an external check runs against the candidate and returns a pass/fail plus concrete feedback.
- Adjust: on failure, the feedback (the actual error, not a summary) goes back into the next turn.
- Stop: pass, or budget exhausted, or a blocked condition — the loop never spins forever.
We will build each beat as a small, testable function so you can swap any piece without rewriting the rest. That separation is the whole point: the loop is infrastructure, the model is a plug-in.
Step 1 - Write an external, objective verifier
Start here, not with the prompt. The verifier defines what "done" means, so it is the contract the whole loop is built around. Ours runs candidate code against a fixed test suite in an isolated subprocess and returns (passed, feedback). Crucially, the model never sees the inside of this function and never gets a vote.
import subprocess, sys, tempfile, os, textwrap
TESTS = textwrap.dedent("""
from solution import slugify
assert slugify("Hello World") == "hello-world"
assert slugify(" Trim Me ") == "trim-me"
assert slugify("Already-slug") == "already-slug"
assert slugify("Multiple spaces") == "multiple-spaces"
assert slugify("Tabs\tand\nnewlines") == "tabs-and-newlines"
print("ALL_TESTS_PASSED")
""")
def verify(code: str):
"""Run candidate code against the suite in an isolated subprocess.
Returns (passed, feedback). The agent never grades itself."""
with tempfile.TemporaryDirectory() as d:
open(os.path.join(d, "solution.py"), "w").write(code)
open(os.path.join(d, "tests.py"), "w").write(TESTS)
r = subprocess.run([sys.executable, "tests.py"], cwd=d,
capture_output=True, text=True, timeout=15)
if r.returncode == 0 and "ALL_TESTS_PASSED" in r.stdout:
return True, "all tests passed"
err = (r.stderr or r.stdout).strip()
return False, err.splitlines()[-1] if err else "unknown failure"
Three properties make this a good gate: it is isolated (a temp dir + subprocess, so a crash or infinite import cannot take down your loop), it is objective (exit code and a sentinel string, no model judgment), and it returns concrete feedback (the last line of the traceback) that the next turn can act on. A timeout is non-negotiable: candidate code can hang.
Step 2 - Build the loop harness with a budget and stop rules
Now the loop itself. For a runnable demo with no API key, we plug in a deterministic fake_model that improves as it sees feedback. This lets you study the control flow before adding a real LLM and a network.
ATTEMPTS = [
# 1) naive: only replaces spaces, ignores trimming + repeated whitespace
'def slugify(text):\n return text.lower().replace(" ", "-")',
# 2) split() collapses all runs of whitespace and strips the ends
'def slugify(text):\n return "-".join(text.lower().split())',
]
def fake_model(history):
"""Stand-in for an LLM. Returns the next candidate given past feedback."""
return ATTEMPTS[min(len(history), len(ATTEMPTS) - 1)]
def run_loop(max_iters=4):
history = []
for i in range(1, max_iters + 1):
code = fake_model(history) # ACT
passed, feedback = verify(code) # VERIFY (external gate)
print(f"[iter {i}] verify -> {'PASS' if passed else 'FAIL'} | {feedback}")
if passed: # STOP: done
print(f"[done] accepted after {i} iteration(s)")
return code
history.append({"code": code, "feedback": feedback}) # ADJUST
print(f"[stop] retry budget exhausted after {max_iters} iterations")
return None # STOP: give up cleanly
if __name__ == "__main__":
final = run_loop()
print("---- final accepted solution ----")
print(final)
Notice the loop has exactly one success exit and one failure exit, and the failure exit is reachable. An agent loop with no give-up path is a bug, not a feature.
Step 3 - Run it and watch the self-correction
Save both snippets into one file and run it. The first candidate fails an assertion (it never trims or collapses repeated spaces), the feedback comes back, and the second candidate passes:
$ python loop_demo.py
[iter 1] verify -> FAIL | AssertionError
[iter 2] verify -> PASS | all tests passed
[done] accepted after 2 iterations
---- final accepted solution ----
def slugify(text):
return "-".join(text.lower().split())
That is the entire idea in eight lines of output. The loop did not stop because something felt right. It stopped because an external check returned green. Swap the test suite and the same harness now solves a different problem with zero changes to the loop.
Step 4 - Swap in a real Claude agent with tool use
The only thing that changes for production is fake_model. Instead of returning a canned string, the model proposes code; we run the same verify gate and feed failures back as the next user turn. The agent's job is to write code, not to decide whether it passed — that authority stays with your verifier.
The pattern below follows the standard Anthropic Messages API loop: call the model, on stop_reason == "tool_use" execute the tool, append the assistant turn plus a tool_result, and continue. Verified against the official tool-use docs.
import os, re, anthropic
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
TOOLS = [{
"name": "run_tests",
"description": "Run the candidate slugify implementation against the hidden "
"test suite. Returns pass/fail and the failing error if any.",
"input_schema": {
"type": "object",
"properties": {"code": {"type": "string",
"description": "Full source for solution.py"}},
"required": ["code"],
},
}]
SYSTEM = ("You are a coding agent. Write a Python function slugify(text). "
"Call run_tests with your full source. If it fails, read the error "
"and fix the code, then call run_tests again. Stop when tests pass.")
def agent_loop(max_iters=6):
messages = [{"role": "user", "content": "Implement and verify slugify(text)."}]
for i in range(max_iters):
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=SYSTEM,
tools=TOOLS,
messages=messages,
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason != "tool_use":
print("[stop] model ended without a tool call")
return None
results = []
for block in resp.content:
if block.type == "tool_use" and block.name == "run_tests":
passed, feedback = verify(block.input["code"]) # external gate
print(f"[iter {i+1}] verify -> {'PASS' if passed else 'FAIL'} | {feedback}")
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": f"passed={passed}; feedback={feedback}",
})
if passed:
return block.input["code"]
messages.append({"role": "user", "content": results}) # ADJUST
print("[stop] retry budget exhausted")
return None
This is loop engineering in production: the model reasons and acts, your code observes via a real test run, and the conversation only ends when the objective gate says green or the budget runs out. The harness is identical to the demo — you replaced the actor, not the loop.
A worked real-world example
Point the same harness at a bug-fix task instead of a greenfield function. Your verifier becomes pytest -q on the repo; "act" becomes "propose a unified diff"; "adjust" feeds the failing test names and assertion diffs back. The loop runs unattended: apply patch, run tests, on red feed the output back, on green open a pull request. This is exactly the "agent that runs while you sleep" workflow people are posting about — and it is safe to leave alone only because the stop condition is a real test run, not the agent's self-assessment. Same four beats, bigger blast radius, same discipline.
Common pitfalls and gotchas
- Self-grading. The cardinal sin. If your stop condition is the model answering "is this correct?", you do not have a verifier, you have a yes-man. Always gate on something external: tests, a compiler, a schema, a diff that applies cleanly.
- No retry budget. A loop without a hard iteration cap is an unbounded bill and a hang waiting to happen. Cap iterations, and in production cap tokens and wall-clock time too.
- Thin feedback. Our demo returns only the last traceback line, so the model has to infer a lot. The richer and more specific the feedback (failing input, expected vs actual), the fewer iterations to converge. Feedback quality is a tuning knob, not an afterthought.
- No isolation. Running candidate code in your main process means one infinite loop or
os.removeruins your day. Subprocess plus timeout plus a temp dir is the minimum; a container is better for untrusted code. - Oscillation. Agents can flip between two wrong answers forever. Detect repeats (hash the candidate), and when you see one, change the prompt, raise the temperature, or stop. A retry budget is your safety net here.
- Verifier gaps. The agent optimizes for whatever your tests check. Weak tests yield code that passes and is still wrong. Treat the test suite as the spec and harden it like one.
- Ignoring the blocked state. Some tasks cannot be finished (missing dependency, ambiguous spec). Make "blocked" a first-class outcome that escalates to a human instead of retrying into the budget ceiling.
Quick reference
| Concept | What it is | Get it wrong and... |
|---|---|---|
| Act | Model proposes a candidate | Too much per step; hard to verify |
| Verify | External objective check (tests/compiler) | Self-grading -> confidently broken output |
| Adjust | Feed concrete failure back into next turn | Thin feedback -> slow or no convergence |
| Retry budget | Hard cap on iterations / tokens / time | Runaway cost, hangs, infinite loops |
| Stop rules | Done / blocked / give-up outcomes | Loop never terminates cleanly |
| Isolation | Subprocess + timeout + temp dir | One bad candidate crashes the harness |
Next steps
- Replace the test-suite verifier with one for your domain:
mypyfor types, JSON Schema for structured output, a linter for style. - Add oscillation detection by hashing candidates and stopping on a repeat.
- Track tokens and wall-clock per run; turn the iteration cap into a real budget.
- Promote the loop to multi-tool: let the agent read files, grep, and run a subset of tests, not just submit one blob.
- Read the original ReAct paper, then map its reason/act/observe steps onto the act/verify/adjust beats you just built.
Loop engineering is not a new model or a new framework. It is the discipline of putting an objective gate between an agent and the word "done." Build the verifier first, cap the budget, make every stop state explicit — and your agent can run while you sleep, because the thing deciding it succeeded is no longer the thing that wrote the code.
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.
Join 2,085 builders reading daily.