What a ralph loop actually is, and how to run one without burning your tokens

A ralph loop is a coding agent in a while loop. What it is, why it resets context each pass, and how to run one without it spinning forever or costing $200 overnight.

Lukas GrigisSoftware ArchitectJuly 1, 202613 min read

0:00 / 0:00

aiai-agentsclaude-codeclideveloper-toolingopen-source

Part of the guide: AI Agent Harnesses: A Field Guide

On this page

What is a ralph loop?
Where the ralph loop came from
Why a fresh context each iteration?
The part everyone skips: the referee
How to run a ralph loop, concretely
The dials that keep it from burning your tokens
Running a ralph loop on Copilot CLI and Codex
Where the ralph loop still breaks
Do you actually need a ralph harness?
How ralphctl compares to the other ralph harnesses
Is a ralph loop just orchestration?

Key takeaways

State lives on disk (git history, files), not in the model's context window, so the agent resets to a clean head every iteration instead of dragging a filling window forward.
The hard part isn't the loop, it's the gate that decides when a task is done: a separate model grades the work against its verification criteria, so the agent can't give itself an A.
Start with the bare `while :; do cat PROMPT.md | claude -p; done` loop tonight; reach for a real harness only when the work outgrows one context window and you need crash recovery.
The cost gap is real (Anthropic measured ~$9 for a quick-but-broken run vs ~$200 for a working one), which is why a ralph harness enforces a token budget the run can't exceed.

A ralph loop is a coding agent in a while loop. That's the whole idea, and most pages stop there. The hard part isn't the loop. It's the gate that decides when a task is actually done, and the part where it doesn't quietly spend $200 of your tokens overnight. This is the version written by someone who built a harness for it and has been running it against real work for months.

The name isn't mine. Geoffrey Huntley coined it on 2025-07-14 in a post called "Ralph Wiggum as a software engineer," and the joke is the point: the technique is dumb in the best way.¹ You loop a coding agent against a goal until the goal is met or you run out of patience. I built ralphctl, one runnable implementation of that idea. So this is a practitioner's explainer, not a press release. Every claim below is runnable, points at a source, or is labelled an opinion.

What is a ralph loop?

Here's the one-sentence version, the one I'd want lifted verbatim:

A ralph loop is an autonomous coding workflow where the agent runs in a fresh context each iteration, with state living on disk (git, files) rather than in the model's context window.

The bare mechanism is a shell loop. Strip everything away and it looks like this:

bash

while :; do
  cat PROMPT.md | claude -p
done

You write the goal into PROMPT.md, the agent reads it, does one pass of work, commits, and the loop hands a clean slate to the next iteration. The git history and the files on disk are the memory. The context window is disposable on purpose. That's the entire trick, and you can run it tonight.

The interesting questions start right after that snippet: how does the loop know when to stop, who decides a task is actually finished, and what keeps it from looping forever on something it can't do? The rest of this post is those questions.

Where the ralph loop came from

The full name is the Ralph Wiggum loop, and Huntley's original post didn't just name it; it named the thing that keeps it honest. He pairs the loop with backpressure: a build or a test gate the agent has to satisfy before the work counts, which is what stops the loop wandering.

Through early 2026 the pattern got picked up and hardened across a wave of open implementations (I compare four of them at the end). The one that matters most for this post is Anthropic's engineering write-up on harness design, which names the Ralph Wiggum loop directly and frames the maker/checker split as a GAN (two models pitted against each other, one generating, one judging): the model that generates is not the model that evaluates.² That is third-party validation for the thing most ralph posts skip, the referee that decides when a task is actually done. For the dated, blow-by-blow timeline, HumanLayer keeps a good one.³

Why a fresh context each iteration?

This is the design choice that separates a ralph loop from "just keep chatting with the agent."

Coding agents work in discrete sessions, like engineers working in shifts: each new one arrives with no memory of what the last one did. You have two ways to handle that gap. You can compact (summarise the conversation in place and keep going), or you can reset (wipe the window and start a fresh agent that boots from a structured handoff on disk).²

Compaction lets the window climb until the agent wraps up early (red). Reset drops it to zero each lap (green) and trusts the disk instead, which grows every lap (orange).

Ralph chooses reset. The reason is a failure mode you only see after you've watched a few long runs: an agent nearing its perceived context limit starts wrapping up early, declaring things done that aren't, because it can feel the window filling. A fresh context every iteration cures that. The agent that boots into iteration 40 has the same clean head the agent in iteration 1 had. The state it needs (what's been built, what's left, what passed) lives in git and in files, so the reset costs nothing.

The part everyone skips: the referee

Exit detection is the hard part. The bare while :; do loop above never stops on its own, which is exactly the "ralph loop not stopping" problem people hit. Stopping cleanly needs a done-condition the loop can check: a task list where items flip from todo to done only after they're verified, and a gate that decides whether "verified" is true. That gate is the referee.

If one decision holds the whole thing up, it's this: the model that writes the code is not the model that decides whether the code is done.

An agent grading its own homework gives itself an A every time.

Anthropic found the same thing and put it plainly: out of the box, the model is a poor QA agent. It finds a real bug, then talks itself into approving the work anyway.²

So in ralphctl an independent reviewer checks each task against its verification criteria. If it fails, the generator gets the critique back — the actual reasons it failed — and tries again. It retries up to harness.maxAttempts, which defaults to 3, before the task is flagged blocked rather than done.⁴

The generator writes; the evaluator gate grades against the task's criteria. Pass goes to done (green); fail returns the critique, up to three tries, then blocked (red).

Three is a deliberate default. Set it too high and a confused agent will burn tokens forever talking a tired evaluator into a pass. Set it too low and you flag work that one more round would have fixed. It's the line between a loop that converges and a loop that spins.

The honest limit: the evaluator can be wrong too. It can pass code that satisfies thin criteria and still miss the point. The harness doesn't rescue badly written criteria. It stops the agent from talking you out of good ones.

How to run a ralph loop, concretely

You've already seen the bare loop at the top: while :; do cat PROMPT.md | claude -p; done. Run that before reaching for anything heavier. When it stops carrying you (no done-condition, no reviewer, no recovery when it crashes at 2am), here's the upgrade:

bash

npm install -g ralphctl   # Node >= 24
ralphctl

ralphctl is at v0.14.0 as of 2026-06-30.⁴ You hand it a sprint description in plain language (how I built that decomposition). It decomposes that into a dependency-ordered task graph, groups independent tasks into waves, and drives each task through the generator-evaluator loop (a separate model grades each task against its verification criteria) before the task counts as done. State (the sprint, the branch, per-task progress) persists across sessions, so an interrupted run resumes the in-progress task instead of losing the work.⁴

The difference between the two snippets is everything this post is about. The first is the idea. The second is the idea plus exit detection, a referee, and crash recovery. Start with the first. Reach for the second when it stops being enough. The "should you build your own?" section below spells out exactly when that is.

The dials that keep it from burning your tokens

If you pay the API bill, read this section first. It's the part the origin pages don't cover.

Quality has a price, and it's measurable. Anthropic's harness write-up put numbers on the same task two ways: a quick solo run came in around 20 minutes and $9 and produced a broken core, while the full multi-agent harness ran roughly 6 hours and $200 and produced something that actually worked.² More than twenty times the cost. That gap is the whole reason a ralph harness needs a budget in the first place.

ralphctl's answer is presets. It ships 20 of them: five families (standard, economic, strong-gate, fast, frontier) across four provider variants each.⁴ Each family encodes a stance. strong-gate puts a cheap generator behind a permanently top-tier evaluator. fast sets harness.escalateOnPlateau = false, so when the model stops improving the loop settles instead of reaching for a more expensive one; every other family sets that true.⁴ The default posture is "use the cheap model, let the task earn the expensive one." On a stall the harness climbs the model ladder one rung at a time, carrying the specific critique upward, and only falls back to "change your approach" when it's out of rungs.

The other budget lever is diff-scoped verify gates. Each module declares its own gates: a pathPrefix, a command, an optional timeout. Before a task runs, ralphctl runs all of them once to set a clean baseline. After the task, it runs only the gates whose pathPrefix matches what the diff actually touched, and fails fast.⁴ You don't re-run the whole test suite because the agent edited one file. You run what the change could plausibly have broken.

I've written the month-by-month version of how these budget dials evolved (and the lock bug that parallelism exposed) in the field report on taking ralphctl from 0.8 to 0.13. This section is the mechanism; that post is the lived ledger.

Running a ralph loop on Copilot CLI and Codex

ralphctl orchestrates three providers: Claude Code, GitHub Copilot CLI, and OpenAI Codex.⁴ From the README's three logos you'd assume they're peers. They're not, and I'd rather you hear that from me than find out mid-sprint.

Only Claude Code is verified end-to-end. Copilot CLI and Codex ship as preview. They run, but the full loop isn't formally validated on them yet. Bundled skill injection quietly no-ops on the preview providers, and Codex can't do fine-grained edit denials on existing files because its sandbox is binary, so the only safety envelope there is path scope (cwd plus --add-dir).⁴

If you got here from a "ralph loop copilot cli" search: it works, it's preview, and Claude is the path I'd trust with your repo today.

Where the ralph loop still breaks

A short, honest list, because you'll meet these.

It can loop forever on unverifiable criteria. If "done" can't be checked, the evaluator can't end the run, and maxAttempts is the only thing standing between you and an overnight token fire. Write checkable criteria.
The harness runs out of memory before the model does. "Long-running" is a property of your own process, not just the agent's context. The bugs that cost me the most weren't in the agent, they were in the orchestrator I built to babysit it: heap leaks, an over-large git diff that OOM'd the process, a paste big enough to scroll-lock the terminal. The model never runs out of memory. Your harness does.
Unattended loops make unattended mistakes. Running --yolo has a blast radius. Sub-agent verification is a claim, not proof.
Parallelism corrupts branches if you let it. ralphctl runs serial by default (concurrency.maxParallelTasks is 1, range 1 to 5). Turn it up and independent same-wave tasks run each in its own git worktree, folding back to one branch, but two tasks editing the same file must be serialised or the second clobbers the first.⁴

The tooling has real failure modes of its own. The official Anthropic ralph-loop plugin has had its stop-hook misbehave; there's a tracked issue for it.⁵ Worth knowing the loop's exit machinery is a known soft spot across implementations, not any one tool's quirk.

Do you actually need a ralph harness?

So should you build your own? My answer, and it's an opinion: reach for a real ralph harness when three things are true at once. The work is too big for one context window. You can't trust the agent to grade itself. And "it crashed at 2am" has to mean "it resumed at 2:01," not "I lost the run." That's the seam a harness is built for: a dependency-ordered graph, an independent referee, and state that survives a crash. If none of those are true, the bare shell loop and a git worktree (a second checkout of the same repo on its own branch) will carry you further than you'd think, and you should stay there.

How ralphctl compares to the other ralph harnesses

I built ralphctl, so read this as a feature inventory, not a neutral review, with the bias declared up front.

Harness	Independent evaluator	Budget / presets	Parallel worktrees	Multi-repo
`snarktank/ralph`	No	No	No	No
`umputun/ralphex`	Yes (5-agent review)	No	Yes	No
`iannuttall/ralph`	No	No	No	No
`ralphctl`	Yes (single reviewer)	Yes (20 presets)	Yes	Yes

Two columns are where ralphctl genuinely stands alone: budget presets and multi-repo sprints. No other implementation I checked is budget-aware, ralphex included.⁶ The evaluator column is not a clean win: ralphex's five-specialist review pipeline is arguably deeper than ralphctl's single reviewer. So the claim that holds up is the combination, not any single cell: ralphctl is the one that puts a review gate, a token budget, parallel worktrees, and multi-repo in the same tool.

Multi-repo is the one worth spelling out, because it's the least visible. One sprint can span several repositories at once (a service and the client SDK that consumes it), each with its own setup and verify scripts, so a change that only makes sense if it lands in both gets driven, verified, and committed as one unit instead of two runs you reconcile by hand. No other ralph implementation I checked does this.

Is a ralph loop just orchestration?

Sort of, and the difference is the point. A classic orchestrator runs a fixed DAG of steps you wired up in advance. A ralph loop is dumber and more stubborn: it re-derives what to do from disk each lap and keeps going until a gate says stop. The task graph gives it structure, but the loop, the fresh context, and the referee are what make it a ralph loop rather than a workflow engine. Building that structure out into something you can trust overnight is the harness itself, and the AI Agent Harnesses field guide is where the fresh context, the referee, and the check gates get assembled into one.

Run the bare while loop tonight. When it stops carrying you, npm install -g ralphctl, point it at something real, and find the place this explainer is too kind: open an issue, or star it so it's there the night you hit the wall.

Source on GitHub · npm · MIT license.

Geoffrey Huntley, "Ralph Wiggum as a software engineer" (2025-07-14). The "ralph" name, the loop-until-done technique, and "backpressure" are his. ↩
Anthropic Engineering, "Harness design for long-running application development". Names the Ralph Wiggum loop; frames the generator/evaluator split as a GAN; reports the "poor QA agent" finding and the ~20 min / $9 (broken) vs ~6 hr / $200 (working) cost comparison on the same build. ↩ ↩² ↩³ ↩⁴
HumanLayer, "A brief history of Ralph". A dated timeline of the technique. ↩
Every ralphctl version and behaviour claim here is drawn from the project's CHANGELOG and source: v0.14.0 (2026-06-30); harness.maxAttempts default 3; the 20-preset matrix (five families across four provider variants: standard, economic, strong-gate, fast, frontier); escalateOnPlateau false for fast, true elsewhere; diff-scoped per-module verify gates; concurrency.maxParallelTasks default 1, range 1 to 5, with per-task git worktrees; three providers with Claude Code verified end-to-end and Copilot CLI + Codex in preview; Node >= 24. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
anthropics/claude-plugins-official, issue #394: stop-hook behavior in the official ralph-loop plugin. ↩
Competitor feature inventory compiled from each project's repository and documented features as of 2026-06-27. Budget-awareness verified against each repo's documented features on that date. ↩

Frequently asked questions

What is a ralph loop?

A ralph loop is an autonomous coding workflow where an AI agent runs in a fresh context each iteration, with state living on disk (git history and files) rather than in the model's context window. The bare mechanism is a shell loop that pipes a goal file into a coding agent over and over until the goal is met. Geoffrey Huntley named the technique on 2025-07-14.

Why is it called a ralph loop, or the Ralph Wiggum loop?

Geoffrey Huntley named it in a 2025-07-14 post, 'Ralph Wiggum as a software engineer.' The joke is the point: a loop this dumb, running a coding agent against a goal until it's met, isn't supposed to work, and the fact that it does is the lesson. 'Ralph loop' and 'Ralph Wiggum loop' are the same technique.

How do you stop a ralph loop?

A bare `while :; do` loop never stops on its own, which is the 'ralph loop not stopping' problem. Clean exit needs a done-condition the loop can check: a task list whose items flip from todo to done only after an independent evaluator verifies them against their criteria. In ralphctl the generator retries up to harness.maxAttempts (default 3) before a task is flagged blocked instead of looping forever.

Why does a ralph loop use a fresh context each iteration?

Because an agent nearing its context limit starts wrapping up early and declaring things done that aren't. A fresh context every iteration cures that: the agent booting into iteration 40 has the same clean head the agent in iteration 1 had. The state it needs lives in git and files, so the reset costs nothing. This is the 'reset' strategy, as opposed to 'compaction' (summarising the conversation in place).

Can you run a ralph loop on GitHub Copilot CLI or OpenAI Codex?

Yes. ralphctl orchestrates three providers: Claude Code, GitHub Copilot CLI, and OpenAI Codex. Only Claude Code is verified end-to-end; Copilot CLI and Codex ship as preview. Bundled skill injection no-ops on the preview providers, and Codex can only be scoped by path (its sandbox is binary), not by fine-grained edit denials.

How do you keep a ralph loop from burning through your token budget?

Set a budget the run must stay inside. ralphctl ships cost-tiered presets that default to a cheap generator and only climb to a pricier model when a task genuinely stalls, one rung at a time. Diff-scoped verify gates re-run only the checks whose path prefix matches what the change actually touched, instead of the whole test suite. Write checkable criteria so the evaluator can end the run.

How do you run a ralph loop with Claude Code?

The bare loop is `while :; do cat PROMPT.md | claude -p; done`: write the goal into PROMPT.md, let the agent do one pass, commit, and reset the context each iteration. Reach for a harness like ralphctl once you need exit detection, an independent reviewer, and crash recovery across more than one context window.

Resources

Repository

RalphCTL

One runnable ralph loop: a cross-provider generator-evaluator harness across Claude Code, GitHub Copilot CLI, and OpenAI Codex, with cost-tiered presets and multi-repo sprints.

aideveloper-toolingcliopen-source

Tool

ralphctl on npm

Install the harness globally: npm install -g ralphctl (Node >= 24).

aideveloper-toolingclinpm

Article

Geoffrey Huntley: "Ralph Wiggum as a software engineer"

The 2025-07-14 origin post that named the ralph loop, the loop-until-done technique, and backpressure.

aiai-agents

Article

Anthropic: Harness design for long-running application development

The engineering write-up that names the Ralph Wiggum loop, frames the generator/evaluator split as a GAN, and reports the ~$9-broken vs ~$200-working cost comparison.

aiai-agents

Article

HumanLayer: A brief history of Ralph

A dated timeline of the ralph technique, from the origin post through the 2026 harness wave.

aiai-agents

Share this article

Enjoyed this article?

Stay in the Loop

Get notified when I publish new articles. No spam, unsubscribe anytime.

What a ralph loop actually is, and how to run one without burning your tokens

Key takeaways

What is a ralph loop?

Where the ralph loop came from

Why a fresh context each iteration?

The part everyone skips: the referee

How to run a ralph loop, concretely

The dials that keep it from burning your tokens

Running a ralph loop on Copilot CLI and Codex

Where the ralph loop still breaks

Do you actually need a ralph harness?

How ralphctl compares to the other ralph harnesses

Is a ralph loop just orchestration?

Frequently asked questions

What is a ralph loop?

Why is it called a ralph loop, or the Ralph Wiggum loop?

How do you stop a ralph loop?

Why does a ralph loop use a fresh context each iteration?

Can you run a ralph loop on GitHub Copilot CLI or OpenAI Codex?

How do you keep a ralph loop from burning through your token budget?

How do you run a ralph loop with Claude Code?

Resources

RalphCTL

ralphctl on npm

Geoffrey Huntley: "Ralph Wiggum as a software engineer"

Anthropic: Harness design for long-running application development

HumanLayer: A brief history of Ralph

Stay in the Loop

More Posts

Earning the overnight run: ralphctl from 0.8 to 0.13

The harness era caught up: ralphctl and the convergence I bet on

From sprint CLI to agent harness: how ralphctl got an evaluator

Building RalphCTL: a sprint CLI for AI-assisted coding

Key takeaways

What is a ralph loop?

Where the ralph loop came from

Why a fresh context each iteration?

The part everyone skips: the referee

How to run a ralph loop, concretely

The dials that keep it from burning your tokens

Running a ralph loop on Copilot CLI and Codex

Where the ralph loop still breaks

Do you actually need a ralph harness?

How ralphctl compares to the other ralph harnesses

Is a ralph loop just orchestration?

Footnotes

Frequently asked questions

What is a ralph loop?

Why is it called a ralph loop, or the Ralph Wiggum loop?

How do you stop a ralph loop?

Why does a ralph loop use a fresh context each iteration?

Can you run a ralph loop on GitHub Copilot CLI or OpenAI Codex?

How do you keep a ralph loop from burning through your token budget?

How do you run a ralph loop with Claude Code?

Resources

RalphCTL

ralphctl on npm

Geoffrey Huntley: "Ralph Wiggum as a software engineer"

Anthropic: Harness design for long-running application development

HumanLayer: A brief history of Ralph

Stay in the Loop

More Posts

Earning the overnight run: ralphctl from 0.8 to 0.13

The harness era caught up: ralphctl and the convergence I bet on

From sprint CLI to agent harness: how ralphctl got an evaluator

Building RalphCTL: a sprint CLI for AI-assisted coding