From sprint CLI to agent harness: how ralphctl got an evaluator
Anthropic's first harness article inspired ralphctl. Their second one on generator-evaluator loops pushed me to build an evaluator for v0.2.0, and the tool changed identity.
Where things stood
I was using ralphctl regularly at this point. It worked. Not perfectly, not every time, but well enough that it had become part of how I ship code. Multi-repo support turned out surprisingly good once you've set it up, though edge cases still trip it up. They trip me up too when I do it manually. Errors happen. You fix them and move on.
If you haven't read the first article, here's the short version:
ralphctl is a CLI that takes tickets, refines them into structured requirements with AI, plans
dependency-ordered tasks, and executes them through Claude Code or GitHub Copilot. State lives in
JSON files under ~/.ralphctl/. Check scripts run after each task to verify the work.
Verification was binary. The check script ran pnpm typecheck && pnpm lint && pnpm test (or
whatever you configured), and the task either passed or failed. Pass means done. Fail means
execution stops and waits for me.
But passing tests doesn't mean the work is correct. It means the existing logic that was already properly tested still works. I kept catching things in review that no automated check would find. The agent had satisfied the linter and the type checker but produced something that didn't match what the ticket asked for. Or it matched the letter of the requirements but missed the point entirely. The check script was a floor, not a ceiling.
Vibe-coding vs. ai-augmented coding
This is something I think about a lot. It matters for why the evaluator exists.
Vibe-coding is when you don't really know what's happening in the code and you don't particularly care. You prompt, accept whatever comes back, ship it. If it works, good enough. Maintainability? Not your problem. For throwaway scripts or prototypes nobody will maintain, that's fine. Sometimes it's the right call.
AI-augmented coding, at least how I see it, is different. I know what's going on. I've read through the changes. I care about maintainability. I understand the decisions being made, and I'm responsible for them. When I commit code, my name is on it. Sometimes those commits are signed.
The evaluator exists because I want to ship the correct what at adequate quality. What gets built should actually match what I asked for. The evaluator is there to catch the gap between "technically passes" and "actually right" before I even open the diff.
Two articles from Anthropic
The first article that shaped ralphctl was Effective harnesses for long-running agents. It describes harness architecture for AI coding agents: two-agent splits, context resets between phases, structured handoff. That's where the original design came from.
More recently, Anthropic published Harness design for long-running apps, which goes deeper on the generator-evaluator pattern. One model produces work, a second model reviews it, the loop repeats until the reviewer is satisfied or a budget runs out. This is what pushed me to build the evaluator for v0.2.0.
I'd been thinking about the review problem for a while at that point. The check script told me whether tests passed. It couldn't tell me whether the code was good, whether the agent had actually addressed the ticket's intent or just made the linter happy. I was doing that review myself, every time, and it added up.
The second article made the next step obvious. I closed the branch I was on and started building.
Building the evaluator
I built it in layers. Wiring everything at once in this kind of feature is how you end up debugging integration issues when you should be debugging logic.
First the plumbing. An evaluationIterations setting in the config. evaluated and
evaluationOutput fields on the task schema. A model field threaded through the spawn path so the
evaluator knows what model generated the code. A verificationCriteria field on the task itself, so
the plan phase can spell out exactly what "done" looks like before any code gets written. I added a
doctor health check too, so ralphctl warns you if evaluation isn't configured.
The verificationCriteria field is doing more work than it looks like. Without it, the evaluator
was inferring the grading bar from the task description and steps, which worked until it didn't. Now
the plan phase emits explicit criteria, and those criteria become the rubric the evaluator grades
against. It's closer to how I'd write a test: here is the thing, here is exactly what done looks
like.
Then the evaluator module. getEvaluatorModel() implements a model ladder: if Opus generated the
code, Sonnet evaluates. Sonnet generated? Haiku evaluates. Haiku generated? Haiku evaluates itself.
Copilot doesn't expose model selection, so that falls back to the default. parseEvaluationResult()
looks for signals in the output. buildEvaluatorContext() assembles the task spec, the verification
criteria, and the check script into a review brief.
The verdict isn't a single yes/no. The evaluator scores four dimensions, correctness, completeness, safety, consistency, each with a one-line finding. Overall result fails if any one of them fails. I tried a single pass/fail first and kept reading evaluator output where the verdict was "fail" and I had to hunt for why. Splitting it into four lines made the output useful at a glance, and it gave the generator a clearer contract when the fix prompt came back: address the failing dimension, leave the rest alone.
The prompt tells the evaluator to act as an independent reviewer and to assume problems exist until
it proves otherwise. It gets full tool access, can run git diff, read files, explore the project.
The instructions say: run the check script yourself, read the actual changes, compare them against
the verification criteria, only pass if the work genuinely meets the bar.
I rewrote the prompts more than once. The current versions lean on Anthropic's prompting guidance: less urgent language, more "here's the why," XML tags for structure, harness context up front. They read better. I think the outputs are tighter for it, though that's hard to measure cleanly.
Wiring it into the executors was messy. The first version had duplicated loop logic between the
sequential and parallel executors. Ugly, but it worked well enough to test against. Same pass also
added a --max-turns safety net so a runaway agent can't chew through tokens forever, and proper
session ID tracking across evaluation iterations so the "fix the critique" pass always resumes the
right session with the right model. Both were bugs waiting to happen.
Tests went in next. Signal parsing edge cases, model ladder mapping, schema backward compatibility, config roundtrip, doctor diagnostics.
Last pass was cleanup and bug fixes.
The review bottleneck
An agent can produce code that compiles, passes every test, follows every lint rule, and still completely misses what the ticket was about. Or it gets the behavior right but introduces a pattern that clashes with how the rest of the codebase works.
Before the evaluator, I was the only thing catching that. Eight tasks overnight meant eight reviews in the morning, and some of them needed rework the agent could have caught if anyone had told it to look.
The evaluator doesn't replace my review. But I open fewer diffs now where the first thing I see is something obvious the agent should have caught on its own.
There's also a sprint insights command that reads across the evaluation results in a sprint and
surfaces patterns. Recurring failures, dimensions that keep tripping, that kind of thing. A single
evaluation output tells you about one task. Ten of them tell you about how the agent works on your
codebase. Different signal.
The name didn't fit anymore
Ralphctl started as a "sprint and task management CLI." That described what it helped with, not what it was. With the generator-evaluator loop, the tool had outgrown the label. It wasn't just queuing and running tasks anymore. It was orchestrating agents in different roles with feedback flowing between them.
Old tagline: "Sprint and task management CLI for AI-assisted coding." New one: "Agent harness for long-running AI coding tasks."
When the name doesn't match what the tool does, you spend half the conversation explaining the gap. That got old.
How the evaluation loop works
When a task completes and passes its check gate, the evaluator runs against the task's
verificationCriteria. It scores the four dimensions, produces an overall pass or fail, and writes
the findings back into the task record. If every dimension passes, done. If any one of them fails,
the generator gets resumed in the same session with the evaluator's critique and instructions to fix
the failing dimensions specifically. Check gate runs again, evaluator runs again. Repeats up to
evaluationIterations times.
It's an autonomous retry loop. The generator gets the feedback, attempts a fix, gets re-checked and re-evaluated. No human in the loop at that point.
But the loop is bounded. The iteration budget stays low (default: 1), and the whole thing is non-blocking. If the evaluator still fails after exhausting its budget, the task moves to "done" anyway. The evaluation output gets persisted so I can read what it found, but the pipeline never permanently stalls on an evaluation failure.
Why bounded? The model ladder. A cheaper model reviews the work (Sonnet for Opus output, Haiku for Sonnet output), which keeps costs reasonable but means the evaluator isn't always as capable as the generator. That's fine for obvious mistakes: missing tests, type errors, logic that doesn't match the spec. Less reliable for subtle architectural judgment. A less capable model confidently telling a more capable one to rewrite something it got right is a real failure mode, and each retry burns tokens.
So the loop catches the obvious stuff on its own, and anything it can't fix within the budget gets surfaced for me. Evaluation never blocks because false positives exist and burning tokens on cycles that might not converge is not a tradeoff I want to make.
Things I learned building this
The sixty lines of duplication between executors. I knew it was wrong when I wrote it. Shipped it
anyway because I wanted to see the feature work end to end before deciding what to extract. The
runEvaluationLoop() extraction happened once I could see the real shape of the shared interface.
Extracting it upfront would have meant guessing at the parameters, and I would have guessed wrong.
The disk I/O thing was dumb in hindsight. getEvaluationIterations() was getting called inside each
task's evaluation path, reading from disk every time. Fifteen tasks, fifteen reads. Cached it at
loop start once I profiled a real run. Similarly, the evaluator prompt originally said
git diff HEAD~1, which breaks when a task produces multiple commits (and the agent doesn't always
squash). Changed it to git log --oneline -10 first, then diff the actual range. Both of these only
showed up when I stopped testing with toy inputs.
Output parsing was annoying. The AI wraps things in markdown code fences sometimes, or returns a
bare JSON array instead of the expected object. None of this showed up in testing. All of it showed
up in real usage when a different model version formatted its output slightly differently. The
parser had to handle what it actually receives, not what the prompt asks for. Same story with
truncation: without a cap, evaluation output was getting persisted into tasks.json at whatever
length the model felt like producing. 50KB blobs per task are not fun to work with. 2,000 character
cap, added after the first real run convinced me.
What's next
Ralphctl is at v0.2.3. Rebrand done, loop in, a handful of patch releases of real-world sanding behind me.
What's next? See what time brings. I'll keep using it, see what works and what doesn't, improve what needs improving. I try to be a better version of myself tomorrow than I am today.
Source on GitHub. npm package. MIT license.
Resources
Enjoyed this article?
Stay in the Loop
Get notified when I publish new articles. No spam, unsubscribe anytime.