A ralph loop is a coding agent in a while loop. That's the whole idea, and most pages stop there.
The hard part isn't the loop. It's the gate that decides when a task is actually done, and the part
where it doesn't quietly spend $200 of your tokens overnight. This is the version written by someone
who built a harness for it and has been running it against real work for months.
The name isn't mine. Geoffrey Huntley coined it on 2025-07-14 in a post called "Ralph Wiggum as a software engineer," and the joke is the point: the technique is dumb in the best way.1 You loop a coding agent against a goal until the goal is met or you run out of patience. I built ralphctl, one runnable implementation of that idea. So this is a practitioner's explainer, not a press release. Every claim below is runnable, points at a source, or is labelled an opinion.
What is a ralph loop?
Here's the one-sentence version, the one I'd want lifted verbatim:
A ralph loop is an autonomous coding workflow where the agent runs in a fresh context each iteration, with state living on disk (git, files) rather than in the model's context window.
The bare mechanism is a shell loop. Strip everything away and it looks like this:
while :; do
cat PROMPT.md | claude -p
doneYou write the goal into PROMPT.md, the agent reads it, does one pass of work, commits, and the
loop hands a clean slate to the next iteration. The git history and the files on disk are the
memory. The context window is disposable on purpose. That's the entire trick, and you can run it
tonight.
The interesting questions start right after that snippet: how does the loop know when to stop, who decides a task is actually finished, and what keeps it from looping forever on something it can't do? The rest of this post is those questions.
Where the ralph loop came from
The full name is the Ralph Wiggum loop, and Huntley's original post didn't just name it; it named the thing that keeps it honest. He pairs the loop with backpressure: a build or a test gate the agent has to satisfy before the work counts, which is what stops the loop wandering.
Through early 2026 the pattern got picked up and hardened across a wave of open implementations (I compare four of them at the end). The one that matters most for this post is Anthropic's engineering write-up on harness design, which names the Ralph Wiggum loop directly and frames the maker/checker split as a GAN (two models pitted against each other, one generating, one judging): the model that generates is not the model that evaluates.2 That is third-party validation for the thing most ralph posts skip, the referee that decides when a task is actually done. For the dated, blow-by-blow timeline, HumanLayer keeps a good one.3
Why a fresh context each iteration?
This is the design choice that separates a ralph loop from "just keep chatting with the agent."
Coding agents work in discrete sessions, like engineers working in shifts: each new one arrives with no memory of what the last one did. You have two ways to handle that gap. You can compact (summarise the conversation in place and keep going), or you can reset (wipe the window and start a fresh agent that boots from a structured handoff on disk).2
Ralph chooses reset. The reason is a failure mode you only see after you've watched a few long runs: an agent nearing its perceived context limit starts wrapping up early, declaring things done that aren't, because it can feel the window filling. A fresh context every iteration cures that. The agent that boots into iteration 40 has the same clean head the agent in iteration 1 had. The state it needs (what's been built, what's left, what passed) lives in git and in files, so the reset costs nothing.
The part everyone skips: the referee
Exit detection is the hard part. The bare while :; do loop above never stops on its own, which is
exactly the "ralph loop not stopping" problem people hit. Stopping cleanly needs a done-condition
the loop can check: a task list where items flip from todo to done only after they're verified, and
a gate that decides whether "verified" is true. That gate is the referee.
If one decision holds the whole thing up, it's this: the model that writes the code is not the model that decides whether the code is done.
An agent grading its own homework gives itself an A every time.
Anthropic found the same thing and put it plainly: out of the box, the model is a poor QA agent. It finds a real bug, then talks itself into approving the work anyway.2
So in ralphctl an independent reviewer checks each task against its
verification criteria. If it fails, the generator gets the critique back — the actual reasons it failed — and tries again. It retries up to harness.maxAttempts, which defaults to 3,
before the task is flagged blocked rather than done.4
Three is a deliberate default. Set it too high and a confused agent will burn tokens forever talking a tired evaluator into a pass. Set it too low and you flag work that one more round would have fixed. It's the line between a loop that converges and a loop that spins.
The honest limit: the evaluator can be wrong too. It can pass code that satisfies thin criteria and still miss the point. The harness doesn't rescue badly written criteria. It stops the agent from talking you out of good ones.
How to run a ralph loop, concretely
You've already seen the bare loop at the top: while :; do cat PROMPT.md | claude -p; done. Run
that before reaching for anything heavier. When it stops carrying you (no done-condition, no
reviewer, no recovery when it crashes at 2am), here's the upgrade:
npm install -g ralphctl # Node >= 24
ralphctlralphctl is at v0.14.0 as of 2026-06-30.4 You hand it a sprint description in plain language (how I built that decomposition). It decomposes that into a dependency-ordered task graph, groups independent tasks into waves, and drives each task through the generator-evaluator loop (a separate model grades each task against its verification criteria) before the task counts as done. State (the sprint, the branch, per-task progress) persists across sessions, so an interrupted run resumes the in-progress task instead of losing the work.4
The difference between the two snippets is everything this post is about. The first is the idea. The second is the idea plus exit detection, a referee, and crash recovery. Start with the first. Reach for the second when it stops being enough. The "should you build your own?" section below spells out exactly when that is.
The dials that keep it from burning your tokens
If you pay the API bill, read this section first. It's the part the origin pages don't cover.
Quality has a price, and it's measurable. Anthropic's harness write-up put numbers on the same task two ways: a quick solo run came in around 20 minutes and $9 and produced a broken core, while the full multi-agent harness ran roughly 6 hours and $200 and produced something that actually worked.2 More than twenty times the cost. That gap is the whole reason a ralph harness needs a budget in the first place.
ralphctl's answer is presets. It ships 20 of them: five families (standard, economic,
strong-gate, fast, frontier) across four provider variants each.4 Each family encodes a
stance. strong-gate puts a cheap generator behind a permanently top-tier evaluator. fast sets
harness.escalateOnPlateau = false, so when the model stops improving the loop settles instead of
reaching for a more expensive one; every other family sets that true.4 The default
posture is "use the cheap model, let the task earn the expensive one." On a stall the harness climbs
the model ladder one rung at a time, carrying the specific critique upward, and only falls back to
"change your approach" when it's out of rungs.
The other budget lever is diff-scoped verify gates. Each module declares its own gates: a
pathPrefix, a command, an optional timeout. Before a task runs, ralphctl runs all of them once
to set a clean baseline. After the task, it runs only the gates whose pathPrefix matches what the
diff actually touched, and fails fast.4 You don't re-run the whole test suite because the
agent edited one file. You run what the change could plausibly have broken.
I've written the month-by-month version of how these budget dials evolved (and the lock bug that parallelism exposed) in the field report on taking ralphctl from 0.8 to 0.13. This section is the mechanism; that post is the lived ledger.
Running a ralph loop on Copilot CLI and Codex
ralphctl orchestrates three providers: Claude Code, GitHub Copilot CLI, and OpenAI Codex.4 From the README's three logos you'd assume they're peers. They're not, and I'd rather you hear that from me than find out mid-sprint.
Only Claude Code is verified end-to-end. Copilot CLI and Codex ship as preview. They run, but
the full loop isn't formally validated on them yet. Bundled skill injection quietly no-ops on the
preview providers, and Codex can't do fine-grained edit denials on existing files because its
sandbox is binary, so the only safety envelope there is path scope (cwd plus
--add-dir).4
If you got here from a "ralph loop copilot cli" search: it works, it's preview, and Claude is the path I'd trust with your repo today.
Where the ralph loop still breaks
A short, honest list, because you'll meet these.
- It can loop forever on unverifiable criteria. If "done" can't be checked, the evaluator can't
end the run, and
maxAttemptsis the only thing standing between you and an overnight token fire. Write checkable criteria. - The harness runs out of memory before the model does. "Long-running" is a property of your own
process, not just the agent's context. The bugs that cost me the most weren't in the agent, they
were in the orchestrator I built to babysit it: heap leaks, an over-large
git diffthat OOM'd the process, a paste big enough to scroll-lock the terminal. The model never runs out of memory. Your harness does. - Unattended loops make unattended mistakes. Running
--yolohas a blast radius. Sub-agent verification is a claim, not proof. - Parallelism corrupts branches if you let it. ralphctl runs serial by default
(
concurrency.maxParallelTasksis1, range 1 to 5). Turn it up and independent same-wave tasks run each in its own git worktree, folding back to one branch, but two tasks editing the same file must be serialised or the second clobbers the first.4
The tooling has real failure modes of its own. The official Anthropic ralph-loop plugin has had its stop-hook misbehave; there's a tracked issue for it.5 Worth knowing the loop's exit machinery is a known soft spot across implementations, not any one tool's quirk.
Do you actually need a ralph harness?
So should you build your own? My answer, and it's an opinion: reach for a real ralph harness when
three things are true at once. The work is too big for one context window. You can't trust the agent
to grade itself. And "it crashed at 2am" has to mean "it resumed at 2:01," not "I lost the run."
That's the seam a harness is built for: a dependency-ordered graph, an independent referee, and
state that survives a crash. If none of those are true, the bare shell loop and a git worktree (a
second checkout of the same repo on its own branch) will carry you further than you'd think, and you
should stay there.
How ralphctl compares to the other ralph harnesses
I built ralphctl, so read this as a feature inventory, not a neutral review, with the bias declared up front.
| Harness | Independent evaluator | Budget / presets | Parallel worktrees | Multi-repo |
|---|---|---|---|---|
snarktank/ralph | No | No | No | No |
umputun/ralphex | Yes (5-agent review) | No | Yes | No |
iannuttall/ralph | No | No | No | No |
ralphctl | Yes (single reviewer) | Yes (20 presets) | Yes | Yes |
Two columns are where ralphctl genuinely stands alone: budget presets and multi-repo sprints. No other implementation I checked is budget-aware, ralphex included.6 The evaluator column is not a clean win: ralphex's five-specialist review pipeline is arguably deeper than ralphctl's single reviewer. So the claim that holds up is the combination, not any single cell: ralphctl is the one that puts a review gate, a token budget, parallel worktrees, and multi-repo in the same tool.
Multi-repo is the one worth spelling out, because it's the least visible. One sprint can span several repositories at once (a service and the client SDK that consumes it), each with its own setup and verify scripts, so a change that only makes sense if it lands in both gets driven, verified, and committed as one unit instead of two runs you reconcile by hand. No other ralph implementation I checked does this.
Is a ralph loop just orchestration?
Sort of, and the difference is the point. A classic orchestrator runs a fixed DAG of steps you wired up in advance. A ralph loop is dumber and more stubborn: it re-derives what to do from disk each lap and keeps going until a gate says stop. The task graph gives it structure, but the loop, the fresh context, and the referee are what make it a ralph loop rather than a workflow engine. Building that structure out into something you can trust overnight is the harness itself, and the AI Agent Harnesses field guide is where the fresh context, the referee, and the check gates get assembled into one.
Run the bare while loop tonight. When it stops carrying you, npm install -g ralphctl, point it
at something real, and find the place this explainer is too kind:
open an issue, or star it so it's there the night
you hit the wall.
Source on GitHub · npm · MIT license.
Footnotes
-
Geoffrey Huntley, "Ralph Wiggum as a software engineer" (2025-07-14). The "ralph" name, the loop-until-done technique, and "backpressure" are his. ↩
-
Anthropic Engineering, "Harness design for long-running application development". Names the Ralph Wiggum loop; frames the generator/evaluator split as a GAN; reports the "poor QA agent" finding and the ~20 min / $9 (broken) vs ~6 hr / $200 (working) cost comparison on the same build. ↩ ↩2 ↩3 ↩4
-
HumanLayer, "A brief history of Ralph". A dated timeline of the technique. ↩
-
Every ralphctl version and behaviour claim here is drawn from the project's CHANGELOG and source: v0.14.0 (2026-06-30);
harness.maxAttemptsdefault3; the 20-preset matrix (five families across four provider variants: standard, economic, strong-gate, fast, frontier);escalateOnPlateaufalse forfast, true elsewhere; diff-scoped per-module verify gates;concurrency.maxParallelTasksdefault1, range 1 to 5, with per-task git worktrees; three providers with Claude Code verified end-to-end and Copilot CLI + Codex in preview; Node >= 24. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 -
anthropics/claude-plugins-official, issue #394: stop-hook behavior in the official ralph-loop plugin. ↩
-
Competitor feature inventory compiled from each project's repository and documented features as of 2026-06-27. Budget-awareness verified against each repo's documented features on that date. ↩
