The harness era caught up: ralphctl and the convergence I bet on

When I started ralphctl, 'harness' was a word from one Anthropic post. Now it's the third phase of AI engineering, there's an arXiv paper, and the whole field has standardized on the patterns I built early.

Lukas GrigisSoftware ArchitectMay 28, 2026Updated June 30, 202613 min read

0:00 / 0:00

aiai-agentsclaude-codeclideveloper-toolingtypescriptopen-source

Part of the guide: AI Agent Harnesses: A Field Guide

On this page

Where this picks up
I didn't invent this. I bet on it early.
The three phases
The build since then
Cross-provider generator-evaluator
The 0.7.x rewrite, and the part where I broke things
The TUI changed what the tool is
Codex as the third provider
The Ralph movement, and what it's rediscovering
What's still hard
Where it stands

Key takeaways

ralphctl v0.8.4 defaults to a cross-provider loop: Claude Opus 4.8 generates, OpenAI Codex on GPT-5.5 evaluates, so a rival lab grades the work.
Harness engineering became AI engineering's third phase after prompt and context engineering; the field standardized on the generator-evaluator split ralphctl bet on early.
The 0.7.0 rewrite broke ralphctl's on-disk schema: one sprint.json became three files, and stdout parsing gave way to a validated file-based signals.json contract.
The 0.7.0 rewrite replaced ralphctl's plain CLI with an Ink TUI dashboard (kanban view, pipeline map, live token budget) you sit in front of.

Where this picks up

Two posts ago, ralphctl was a sprint CLI: take a ticket, refine it into requirements with AI, plan dependency-ordered tasks, run them through Claude Code or Copilot, gate the result with a check script. That's the first article. One post ago, I added a second AI pass that reviews each task before the harness marks it done, and the tool stopped being a sprint CLI and became an agent harness. That's the second one.

If you read neither, the short version is this: ralphctl wraps an AI coding CLI in structure. One model generates the code, another model checks it against the spec, the loop repeats until the work passes or a budget runs out. State lives in files under ~/.ralphctl/. I use it to ship real code, including ralphctl itself.

This post is about what happened to the word "harness" while I was busy building one.

I didn't invent this. I bet on it early.

Let me be honest about the framing, because it would be easy to write the smug version of this post and I don't want to.

When I started, "harness" was a term I'd picked up from a single Anthropic engineering article. It described a way of wrapping long-running agents in structure, and it matched a problem I already had. So I built something around it. Then Anthropic published a second article on generator-evaluator loops, and I built that too. The patterns came from labs with far more data than I'll ever have. OpenAI landed on the same separation independently. I read the posts, recognized the problem from my own messy runs, and implemented the ideas in a tool I actually use.

What I got right was timing. I bet on the harness as the unit of work back when most people were still tuning prompts, and the field has since standardized on exactly the structure I built. That's the vindication. Not "I thought of it first." More like: I read the room early and put in the work while the idea was still a blog post instead of a movement.

So when I say the harness era caught up, I mean the rest of the field arrived where I'd already set up camp. That feels good. It's also just what happens when you pay attention to the right people.

The three phases

Here's the shape the field settled into, roughly. First everyone was doing prompt engineering: getting the wording right, coaxing better output from a single call. Then context engineering became the thing: what you feed the model, how you manage the window, what survives between turns. Now the framing I keep seeing is harness engineering as the third phase, the layer that wraps the agent in structure and verification so it can run long without falling over.¹

Same craft, a bigger wrapper each time.

There's even an arXiv paper now on automatically evolving harnesses from observability data.² When your hobby project's core concept shows up in a paper title, you've either aged into the mainstream or the mainstream aged into you. Either way, the word isn't niche anymore.

The pattern that converged hardest is the one I wrote the last post about. Separate the generator from the evaluator, because an agent grading its own homework gives itself an A. Both Anthropic and OpenAI got there independently, which is usually a sign the idea is load-bearing rather than fashionable.³ The current best-practice version is a fresh-context evaluator that starts every criterion at fail and has to open the actual evidence, run the checks, read the diff, before it's allowed to pass anything.⁴ Default-fail. The agent has to earn the green. That's exactly the posture I want the reviewer to take, and it's why ralphctl's evaluator gets full tool access and instructions to assume problems exist until it proves otherwise.

Reading that the rest of the field had named and formalized the thing I'd been hand-rolling was a strange feeling. Validating, mostly. A little bit like watching someone publish the recipe for a meal you've been cooking from instinct.

The build since then

The last post ended at v0.2.3. ralphctl is at v0.8.4 now. A lot happened in between, and not all of it was clean. Four things are worth telling.

Cross-provider generator-evaluator

This is the one I'm proudest of and the one I'm least able to dress up as a master plan.

The default loop now runs Claude Opus 4.8 as the generator and OpenAI Codex on GPT-5.5 as the evaluator.⁵ A rival lab's model grades the work my primary lab's model produces. People hear that and assume I designed it as some clever "outside eyes" strategy. I didn't, not at first.

The real story: curiosity was the main driver. I first had the harness working for Claude, then Copilot, then Codex, each one working ok-ish. Prompts are prompts and the models do what you tell them. Mostly. But then I had the idea to interleave. Why not? Once each provider could play any role, nothing stopped me from putting a different one on each side of the loop. Cost and capability are a reason too, sure, why not use the best model for each part of the flow. But that justification came after the tinkering, not before it.

Now that it runs, I'll admit the outside-perspective benefit is real. An evaluator from a different lab doesn't share the generator's blind spots or its house style, so it pushes back on things a same-family reviewer would wave through. I won't pretend I planned that. I'll just say it's a good thing to have stumbled into.

The cost: this only works because every provider speaks the same file-based contract underneath, and keeping that contract honest across three CLIs is ongoing work. More on that in the "still hard" section.

The 0.7.x rewrite, and the part where I broke things

Around 0.6.0 I rewrote the internals into a 5-module Clean Architecture: a kernel chain framework, Result<T, E> threaded end to end so every exit is a typed error, strict ESLint fences between layers.⁵ Then 0.7.0 went further and broke the on-disk schema. The single sprint.json became three files. The old layout simply doesn't parse anymore. I also replaced stdout parsing with a file-based contract: the AI writes a signals.json envelope, the harness validates it after the spawn. Cleaner. More robust against a vendor tweaking their JSON shape. Also a hard break for anyone with existing data.

The 0.7.x line shipped an on-disk rewrite, a settings-schema refactor, a domain rename, and a signal-pipeline migration across four releases in six days. I wrote an apology into the 0.8.0 changelog notes for the churn, because it was a lot to keep up with and some of it landed on people who were actually using the tool.

Was moving that fast worth it? I go back and forth. The architecture is genuinely better now, and I'd rather take the breakage early while the user count is small than carry a bad schema to 1.0. But "I broke your data four times in a week, sorry" is not a sentence I want to make a habit of. The honest answer is that I optimized for the tool I wanted to maintain over the people already depending on it, and that's a trade with a real cost even when it's the right call.

The TUI changed what the tool is

ralphctl started as a plain CLI. You typed commands, it printed output. Around the 0.7.0 rewrite I replaced that with a full responsive dashboard built on Ink: a kanban-style sprint detail view, a pipeline-map home screen, a live token-budget card, a stream of signals as the agent emits them, all keyboard-driven.⁵

Why a terminal dashboard for an agent harness? Because the agent runs long, and "long" plus "no visibility" is how you end up with eight tasks done overnight and no idea which one went sideways. The old CLI told you the result. The dashboard tells you the state: which task is active, what the evaluator just found, how many tokens you've burned, whether a baseline went red. When you're letting agents work unattended, the thing you actually need is a window into what they're doing while they do it.

It changed what the tool is. It used to be something you invoked. Now it's something you sit in front of while it works, the way you'd watch a build.

Codex as the third provider

The provider order went Claude Code, then GitHub Copilot, then OpenAI Codex, with per-flow selection so you can put a different backend on refine, plan, and each side of the implement loop.⁵

Multi-provider parity sounds like an abstraction problem and is actually a hundred small ones. Codex reports its session id as thread_id, not session_id, so resume silently broke until I read its output format closely. Copilot's autopilot mode caps continuations at five by default, which is fewer than a real implement task needs, so generators were halting mid-task with no signals file. Codex's sandbox is binary, read-only or workspace-write, with no fine-grained deny, so path scope is the only safety envelope it gets. Each of these was a specific bug found by running real work through a real provider, not something the type system caught.

Claude Code is the stable, verified path. Copilot and Codex are labeled preview in the README, and the tool tells you when you're on one. Parity is real, but keeping it that way is constant work.

The Ralph movement, and what it's rediscovering

ralphctl is named after Geoffrey Huntley's "Ralph Wiggum technique," which I wrote about in the first post. His version is deliberately minimal: a single agent in a bash while loop, anti-orchestration by design. ralphctl went the other way, into structured multi-agent orchestration with verification gates.

Since then, Ralph has gone viral. There's a ralph-wiggum.ai site, an awesome-ralph list, a Laracasts series, and a steady stream of "Ralph plus spec-driven development" pairings.⁶ The loop caught on. People are running agents in tight iterate-and-refine cycles and getting real work out of them.

I'll say this gently, because the energy is good and the technique genuinely works: the Ralph-plus-SDD crowd is rediscovering what the harness already solved. A loop and a spec get you to the starting line, not the finish. The loop has no opinion about whether the work is correct, so the agent satisfies the linter and misses the point. Something independent has to check that it did what the spec asked. That's the gap the evaluator and the verification gate close, and it's the whole difference between "the agent ran" and "the work is right."

None of this is a dunk. I think the Ralph community is going to converge on harnesses the same way the labs did, because the problem pushes you there whether you planned for it or not. I just got pushed there earlier.

What's still hard

Two things, honestly.

The first is the tension between moving fast and not breaking the people who depend on you. I broke ralphctl's on-disk format repeatedly across 0.7.x. Each break was defensible on its own and the sum was a tool that occasionally ate your data between Tuesday and Friday. I still don't have a clean answer. Pre-1.0 software gets to break things, and a small user base is the cheapest time to do it, but every break spends trust I'd rather keep. I lean toward fast because the architecture payoff is real, and I try to make the upgrade path a single "back up and start fresh" command. It's not a solved problem. It's a trade I keep making with my eyes open.

The second is that non-Claude evaluators are flakier. This is documented in the 0.8.2 changelog.⁵ Codex and Copilot trip the strict signals.json file contract more often than Claude does, a missing or malformed or schema-invalid envelope shows up more frequently when a non-Claude provider is on the evaluator side. I changed the failure handling so one bad turn blocks just that task and surfaces it for a rerun, instead of tearing down the whole implement run. But the underlying fact stands: cross-provider parity leaks. The contract is the same on paper and the providers honor it unevenly in practice. Using a rival lab's model as the judge is a nice property, and it's also the source of the most common failure I see.

Neither of these is going away soon. Both are the kind of problem you manage rather than fix.

Where it stands

ralphctl is at v0.8.4, three providers, on npm. npm install -g ralphctl and it runs. Claude Opus 4.8 generates, OpenAI Codex on GPT-5.5 evaluates, by default, and you can rearrange that per flow. The TUI is the primary surface now. The architecture is the cleanest it's been.

I started this expecting to build a small tool for a small problem. The problem turned out to be one the whole field was walking toward, and the structure I bet on early is the structure everyone seems to be landing on now. I don't think that makes me a visionary. I think it makes me someone who read the right blog posts and did the work before it was obvious.

What's next is the same as it's always been: keep using it, see what breaks, fix what's worth fixing. The harness era arrived. I'm going to keep sanding the one I've got.

The deeper question this raises, why I trust a harness enough to ship code I didn't read line by line, is its own post: Trusting AI-generated code: the harness, not the model.

Source on GitHub. npm package. MIT license.

Faros AI, Harness Engineering: Making AI Coding Agents Work in 2026. Frames harness engineering as the third phase of AI-engineering maturity, after prompt engineering and context engineering. ↩
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses, arXiv. Treats the harness itself as something that can be evolved automatically from observability data. ↩
Anthropic Engineering, Harness design for long-running application development. The generator-evaluator article that pushed me to build ralphctl's evaluator. Separates the model that produces work from the model that reviews it, because agents overpraise their own output. ↩
MindStudio, The Planner-Generator-Evaluator Pattern. Describes the plan → generate → evaluate structure and the fresh-context, default-fail evaluator: every criterion starts false and the evaluator must open evidence to pass it. ↩
All ralphctl version and behaviour claims here are drawn from the project's CHANGELOG: the 0.6.0 Clean Architecture rewrite, the 0.7.0 on-disk schema break and file-based signals.json contract, the 0.8.0 churn note, the 0.8.1 cross-provider gen/eval defaults (Claude Opus 4.8 generator, OpenAI Codex / GPT-5.5 evaluator), and the 0.8.2 fix for non-Claude evaluators tripping the strict contract. ↩ ↩² ↩³ ↩⁴ ↩⁵
Geoffrey Huntley's Ralph technique gone wide: ralph-wiggum.ai and the awesome-ralph list. The original Ralph Wiggum Technique is the namesake; the first post covers how ralphctl differs from the minimal single-agent loop. ↩