Back to Blog

From task CLI to agent harness: how ralphctl got an evaluator

11 min read
aideveloper-toolingtypescriptopen-source

Where things stood

I was using ralphctl regularly at this point. It worked. Not perfectly, not every time, but well enough that it had become part of how I ship code. Multi-repo support turned out surprisingly good once you've set it up, though edge cases still trip it up. They trip me up too when I do it manually. Errors happen. You fix them and move on.

If you haven't read the first article, here's the short version: ralphctl is a CLI that takes tickets, refines them into structured requirements with AI, plans dependency-ordered tasks, and executes them through Claude Code or GitHub Copilot. State lives in JSON files under ~/.ralphctl/. Check scripts run after each task to verify the work.

Verification was binary. The check script ran pnpm typecheck && pnpm lint && pnpm test (or whatever you configured), and the task either passed or failed. Pass means done. Fail means execution stops and waits for me.

But passing tests doesn't mean the work is correct. It means the existing logic that was already properly tested still works. I kept catching things in review that no automated check would find. The agent had satisfied the linter and the type checker but produced something that didn't match what the ticket asked for. Or it matched the letter of the requirements but missed the point entirely. The check script was a floor, not a ceiling.

Vibe-coding vs. ai-augmented coding

This is something I think about a lot. It matters for why the evaluator exists.

Vibe-coding is when you don't really know what's happening in the code and you don't particularly care. You prompt, accept whatever comes back, ship it. If it works, good enough. Maintainability? Not your problem. For throwaway scripts or prototypes nobody will maintain, that's fine. Sometimes it's the right call.

AI-augmented coding, at least how I see it, is different. I know what's going on. I've read through the changes. I care about maintainability. I understand the decisions being made, and I'm responsible for them. When I commit code, my name is on it. Sometimes those commits are signed.

The evaluator exists because I want to ship the correct what at adequate quality. What gets built should actually match what I asked for. The evaluator is there to catch the gap between "technically passes" and "actually right" before I even open the diff.

Two articles from Anthropic

The first article that shaped ralphctl was Effective harnesses for long-running agents. It describes harness architecture for AI coding agents: two-agent splits, context resets between phases, structured handoff. That's where the original design came from.

More recently, Anthropic published Harness design for long-running apps, which goes deeper on the generator-evaluator pattern. One model produces work, a second model reviews it, the loop repeats until the reviewer is satisfied or a budget runs out. This is what pushed me to build the evaluator for v0.2.0.

I'd been thinking about the review problem for a while at that point. The check script told me whether tests passed. It couldn't tell me whether the code was good, whether the agent had actually addressed the ticket's intent or just made the linter happy. I was doing that review myself, every time, and it added up.

The second article made the next step obvious. I closed the branch I was on and started building.

Building the evaluator

I built it in layers. Wiring everything at once in this kind of feature is how you end up debugging integration issues when you should be debugging logic.

First the plumbing. An evaluationIterations setting in the config. evaluated and evaluationOutput fields on the task schema. A model field threaded through the spawn path so the evaluator knows what model generated the code. I added a doctor health check too, so ralphctl warns you if evaluation isn't configured.

Then the evaluator module. getEvaluatorModel() implements a model ladder: if Opus generated the code, Sonnet evaluates. Sonnet generated? Haiku evaluates. Haiku generated? Haiku evaluates itself. Copilot doesn't expose model selection, so that falls back to the default. parseEvaluationResult() looks for signals in the output. buildEvaluatorContext() assembles the task spec and check script into a review brief.

The prompt tells the evaluator to be a "skeptical code reviewer." It gets full tool access, can run git diff, read files, explore the project. The instructions say: run the check script yourself, read the actual changes, compare them against the task requirements, only pass if the work genuinely meets the bar.

Wiring it into the executors was messy. The first version had duplicated loop logic between the sequential and parallel executors. Ugly, but it worked well enough to test against.

Tests went in next. Signal parsing edge cases, model ladder mapping, schema backward compatibility, config roundtrip, doctor diagnostics.

Last pass was cleanup and bug fixes. More on that below.

The review bottleneck

An agent can produce code that compiles, passes every test, follows every lint rule, and still completely misses what the ticket was about. Or it gets the behavior right but introduces a pattern that clashes with how the rest of the codebase works.

Before the evaluator, I was the only thing catching that. Eight tasks overnight meant eight reviews in the morning, and some of them needed rework the agent could have caught if anyone had told it to look.

The evaluator doesn't replace my review. But I open fewer diffs now where the first thing I see is something obvious the agent should have caught on its own.

The name didn't fit anymore

Ralphctl started as a "sprint and task management CLI." That described what it helped with, not what it was. With the generator-evaluator loop, the tool had outgrown the label. It wasn't just queuing and running tasks anymore. It was orchestrating agents in different roles with feedback flowing between them.

Old tagline: "Sprint and task management CLI for AI-assisted coding." New one: "Agent harness for long-running AI coding tasks."

When the name doesn't match what the tool does, you spend half the conversation explaining the gap. That got old.

How the evaluation loop works

When a task completes and passes its check gate, the evaluator runs. If it passes, done. If it fails, the generator gets resumed in the same session with the evaluator's critique and instructions to fix the issues. Check gate runs again, evaluator runs again. This repeats up to evaluationIterations times.

It's an autonomous retry loop. The generator gets the feedback, attempts a fix, gets re-checked and re-evaluated. No human in the loop at that point.

But the loop is bounded. The iteration budget stays low (default: 1), and the whole thing is non-blocking. If the evaluator still fails after exhausting its budget, the task moves to "done" anyway. The evaluation output gets persisted so I can read what it found, but the pipeline never permanently stalls on an evaluation failure.

Why bounded? The model ladder. A cheaper model reviews the work (Sonnet for Opus output, Haiku for Sonnet output), which keeps costs reasonable but means the evaluator isn't always as capable as the generator. That's fine for obvious mistakes: missing tests, type errors, logic that doesn't match the spec. Less reliable for subtle architectural judgment. A less capable model confidently telling a more capable one to rewrite something it got right is a real failure mode, and each retry burns tokens.

So the loop catches the obvious stuff on its own, and anything it can't fix within the budget gets surfaced for me. Evaluation never blocks because false positives exist and burning tokens on cycles that might not converge is not a tradeoff I want to make.

Things I learned building this

The sixty lines of duplication between executors. I knew it was wrong when I wrote it. Shipped it anyway because I wanted to see the feature work end to end before deciding what to extract. The runEvaluationLoop() extraction happened once I could see the real shape of the shared interface. Extracting it upfront would have meant guessing at the parameters, and I would have guessed wrong.

The disk I/O thing was dumb in hindsight. getEvaluationIterations() was getting called inside each task's evaluation path, reading from disk every time. Fifteen tasks, fifteen reads. Cached it at loop start once I profiled a real run. Similarly, the evaluator prompt originally said git diff HEAD~1, which breaks when a task produces multiple commits (and the agent doesn't always squash). Changed it to git log --oneline -10 first, then diff the actual range. Both of these only showed up when I stopped testing with toy inputs.

Output parsing was annoying. The AI wraps things in markdown code fences sometimes, or returns a bare JSON array instead of the expected object. None of this showed up in testing. All of it showed up in real usage when a different model version formatted its output slightly differently. The parser had to handle what it actually receives, not what the prompt asks for. Same story with truncation: without a cap, evaluation output was getting persisted into tasks.json at whatever length the model felt like producing. 50KB blobs per task are not fun to work with. 2,000 character cap, added after the first real run convinced me.

What's next

Ralphctl is at v0.2.0. The evaluator is in, the rebrand is done.

What's next? See what time brings. I'll keep using it, see what works and what doesn't, improve what needs improving. I try to be a better version of myself tomorrow than I am today.

Source on GitHub. npm package. MIT license.

Resources

More Posts