Back to Blog

Earning the overnight run: ralphctl from 0.8 to 0.13

Last post I confessed ralphctl broke users' data four times in a week with no clean fix. Ten releases later, it has one. The unglamorous month that took the harness from 'works in the demo' to something I trust to run overnight: safe migrations, a memory, parallel tasks, a real budget.

Lukas GrigisSoftware Architect10 min read
0:00 / 0:00
aiai-agentsclaude-codeclideveloper-toolingtypescriptopen-source
A glowing blue light-trail winding upward through a dark starfield to a single bright amber point in deep space.
On this page

Where this picks up

Three posts in, the shape of ralphctl is settled. It wraps an AI coding CLI in structure: one model generates the code, a second model checks it against the spec, the loop repeats until the work passes or a budget runs out. State lives in files under ~/.ralphctl/. I use it to ship real code, including ralphctl itself. The first post built the sprint CLI, the second added the evaluator that turned it into a harness, and the third was the victory lap: the rest of the field standardised on the structure I had bet on early.

The victory lap had a confession buried in it. I had broken ralphctl's on-disk data format four times in one week across the 0.7.x line, written an apology into the changelog, and admitted I did not have a clean answer. "Pre-1.0 software gets to break things," I wrote, "and every break spends trust I'd rather keep." I left it there as an open problem.

This post is the month I closed it. Ten releases, v0.8.4 to v0.13.1, late May into June.1 Most of them are the unglamorous kind nobody screenshots. They are also the ones that turn a validated prototype into something you can leave running overnight without flinching.

The apology I made good on

The old sin was simple. I changed the on-disk schema, and your data did not survive the change. The closest thing to a safety net I offered was "back up and start fresh," which is a polite way of saying the safety work was your problem.

Version 0.13.0 replaced that with an actual migration. The first time you open the dashboard after upgrading, ralphctl stops and shows you a consent screen. It explains what is about to change, runs a dry-run summary so you see the scope before anything moves, takes a full backup of your data/ directory before it touches a single file. Then it performs the renames atomically and writes a version stamp so it never runs the same migration twice. If anything fails partway, you get a rollback screen instead of a corrupted directory. The whole thing is idempotent and crash-safe. Run it from a script with no terminal and it skips the gate entirely, because a CI job should never block on a consent prompt.

Every step is reversible until the version stamp is written.

What it migrates to is worth a sentence, because it is the other half of the same idea. Files used to be named by a raw id, the kind you cannot read at a glance. Now they are <id>--<slug>: still chronologically sortable, but you can finally tell which file is which project by looking at it. The learning ledger got a learnings.md mirror next to its machine-readable .ndjson, so you can read what the harness learned without a tool. Ids switched to uuidv7 so two things created in the same millisecond still sort in the order they happened.

None of that is clever. It is the boring engineering I skipped the first time, because the user count was small and moving fast felt free. "Back up and start fresh" was me offloading the safety work onto the user, and the consent splash is me finally doing it myself. Same instinct that broke the data four times, move fast and fix the architecture, but with a seatbelt I should have built earlier.

The harness remembers now

Until this stretch, every sprint started cold. Whatever the harness figured out in one run, that this repo needs a running Postgres before its tests pass, that a particular build flakes under load, was gone by the next run. It relearned the same lessons on my time and my token budget.

Version 0.9.0 gave it a memory. Per-attempt learning signals are appended to a project-scoped ledger, NDJSON at <dataRoot>/memory/<projectId>/learnings.ndjson. At sprint close, an opt-in step that defaults to "no" distils the curated learnings into each provider's native context file, so the next sprint starts already knowing what the last one discovered.

I want to be honest about how deliberately dumb this is. There is no retrieval engine, no embeddings, no vector store. One file per provider, read top to bottom. I considered the fancy version and did not build it, because the whole point is that a project's worth of notes fits in a file the model can just read, and the moment you add retrieval you add a second thing that can be wrong. The step is human-gated for a reason too: memory that writes itself is memory that quietly poisons itself, so I approve what gets remembered. It is the smallest version of memory that does the job, and over a multi-day project it noticeably cut down the Groundhog Day feeling.

Going parallel without clobbering myself

Version 0.9.0 also let tasks run in parallel. It is opt-in, maxParallelTasks from 1 to 5, default 1, so serial behaviour is unchanged byte for byte. Each dependency wave runs its tasks concurrently up to the cap; waves stay sequential, because a task that depends on another still has to wait its turn. Every parallel task gets its own isolated git worktree, and its commits fold back onto one shared sprint branch, so it is still one pull request per sprint.

Going parallel is the easy part. The hard part is two agents not corrupting each other's work, and turning concurrency on surfaced a bug that had been sitting in serial mode the whole time. The cross-process lock that is supposed to stop two ralphctl runs from racing the same branch judged staleness by age: thirty seconds since the lock was taken and you were fair game. But an implement run holds the lock for its entire duration, which is routinely much longer than thirty seconds. So a long, perfectly healthy run became eligible for takeover while it was still going. A second process could then grab the same branch and silently break the one-pull-request guarantee.

The fix was to back the lock with a heartbeat. A live holder refreshes it in the background and is never falsely stolen, no matter how long the run lasts, while a crashed holder still gets reclaimed once it goes quiet. And if a run does lose its lock mid-flight, it now aborts instead of continuing to mutate a branch another process may already own. Parallelism did not create that race. It just guaranteed I would finally hit it, because nothing exposes a concurrency bug like running the thing concurrently on purpose.

The tool grew opinions about money

When I started, ralphctl ran the best model I had on every step. That is the naive default, and it is expensive in a way you do not notice until the bill or the rate limit shows up.

Two releases gave the tool an opinion about cost. The presets went from a handful to twenty, organised into families: an economic family that runs the generator one tier below flagship and only climbs higher when a task genuinely stalls, a fast family at low effort for light work, a frontier family that runs flagship everywhere, and strong-gate variants that pair a cheap generator with a permanently top-tier evaluator. The default posture flipped. Use the cheap model, and let the task earn the expensive one.

"Earn it" is the part I am happiest with. When the generator-evaluator loop stops making progress, the harness used to bump the model once and hope. Now it climbs the ladder one rung at a time across successive stalls, carrying the specific critique up to each stronger model, and only when it runs out of rungs does it fall back to telling the model to change its approach. A task that still cannot pass keeps its work and ships flagged "done with warnings" rather than getting thrown away, and that flag now shows up everywhere it should: the journal, the pull-request body, a glyph on the task card, each with a plain-language reason. The per-module verify gates fit the same theme. A task in a monorepo that touches one module no longer pays to re-run every other module's test suite.

It is the difference between a tool that spends your money and a tool that has a budget. The budget is what makes it usable on real work instead of demos.

What's still hard: the ground moves under you

The open problem this month was not mine. It belonged to the vendors.

Mid-release-train, a model that sat in my own frontier preset got export-controlled out from under me. Anthropic's Fable 5 went down server-side on June 12, and a preset I shipped pointed straight at it. I could not just delete the model, because persisted configs referenced it and deleting it would break them, so I added a suspended-models list: the catalog entries stay valid, the adapters reject those models fast at launch with a clear message, and the pickers flag them (suspended). Switching the model back on is a one-line revert for the day it returns.

That is the standing tax of building on top of other people's model catalogs. They change without warning. Copilot's supported-models list got reshuffled the same month, so I reconciled against it, dropping the models that were de-listed and adding the ones that were renamed. I also added a per-session probe that narrows each picker to the models your account can actually run, and made it fail open, so a probe error never hides every model from you. None of this is a feature anyone asked for. It is the maintenance cost of a tool whose whole premise is "use any provider," and it does not go away. You reconcile against their lists again, and again after that.

Where it stands

ralphctl is at v0.13.1, on npm, three providers, MIT. npm install -g ralphctl and it runs. The default loop still generates with one lab's flagship and evaluates with another's, the TUI is still the surface you sit in front of, and the architecture is the cleanest it has been.

The thing I keep relearning is that the interesting work and the important work are rarely the same work. The third post was the interesting story: the field caught up to a bet I made early. This one is the important story, and it is almost entirely unglamorous. A consent screen. A heartbeat on a lock. A flat file the model reads top to bottom. A list of models I am not allowed to use this week. None of it is going in a paper. All of it is what moves a tool from "works in the demo" to "I left it running overnight and trusted what I found in the morning."

That trust is the whole game, and it is a post of its own. The harness era arrived last month. This month I spent making the harness boring enough to depend on.

Source on GitHub. npm package. MIT license.

Footnotes

  1. Every ralphctl version and behaviour claim here is drawn from the project's CHANGELOG: the 0.9.0 parallel execution, learning ledger, and lock-heartbeat fix; the 0.11.0 economic presets and patient stall recovery; the 0.12.0 twenty-preset matrix, per-module verify gates, and Fable 5 suspension; and the 0.13.0 consent-gated data migration and <id>--<slug> layout.

Frequently asked questions

What changed in ralphctl between v0.8 and v0.13?

Ten releases over four weeks: opt-in parallel task execution in isolated git worktrees (0.9.0), a per-project learning ledger so the harness remembers across sprints (0.9.0), a skills subsystem (0.10.0), patient one-rung-at-a-time model escalation (0.11.0), a preset matrix that grew to twenty cost-tiered configurations (0.12.0), and a consent-gated, crash-safe data migration (0.13.0).

Can ralphctl run coding tasks in parallel?

Yes, since 0.9.0, opt-in via settings.concurrency.maxParallelTasks (1 to 5). The default is 1, so serial behaviour is unchanged. Each dependency wave runs its tasks concurrently up to the cap while waves stay sequential. Every parallel task gets its own isolated git worktree and its commits fold back onto a single shared sprint branch, so it is still one pull request per sprint.

How does ralphctl avoid corrupting your data when you upgrade?

As of 0.13.0, the first dashboard launch after an upgrade shows a consent screen, runs a dry-run summary, takes a full backup of your data directory before touching anything, performs atomic renames, and writes a version stamp so a migration never runs twice. It is idempotent and crash-safe, with a rollback screen on failure. Non-interactive (non-TTY) runs skip the gate so scripts are never blocked.

Does ralphctl remember anything between sprints?

Yes, since 0.9.0. Per-attempt learning signals are appended to a project-scoped NDJSON ledger. At sprint close, an opt-in, human-gated step distils the curated learnings into each provider's native context file. It is deliberately simple: one file per provider, full-file read-back, no retrieval engine or vector store.

Resources

Share this article

Enjoyed this article?

Stay in the Loop

Get notified when I publish new articles. No spam, unsubscribe anytime.

More Posts