Trusting AI-generated code: the harness, not the model

Trusting AI-generated code you didn't read line by line: trust the harness you built, not the model. AI now assists ~42% of committed code.

Lukas GrigisSoftware ArchitectJune 7, 2026Updated June 30, 20268 min read

0:00 / 0:00

aiai-agentsclaude-codedeveloper-toolingsoftware-delivery

Part of the guide: AI Agent Harnesses: A Field Guide

On this page

The last few releases of ralphctl were built, in part, by ralphctl
You can't read it all, and "read it all" was never the plan
"Never trust it" and "just trust it" are both wrong
Trust didn't vanish. It moved.
The pipe in practice: gates, an evaluator, and signals
The harness lies too
Where I still read every line
The skill now is knowing which layer to distrust

Key takeaways

Trust the harness you built around the AI, not the model: the scripts, gates, and checks are deterministic, inspectable, and yours.
Sonar's 2026 survey: AI assists about 42% of committed code, yet 96% don't fully trust it and only 48% always verify before committing.
Wrap the model like a flaky network call: clear interface, structured result, executable verification, then trust the wrapper, not the stochastic core.
A few things still get read line by line regardless of green checks: anything that costs money, security and authorization, and irreversible data migrations.

The last few releases of ralphctl were built, in part, by ralphctl

ralphctl is a CLI I wrote to run AI coding agents through a structured sprint. At some point I started using the released version to build the next one. I change something, point a dev build at the work, and let it run. The tool I ship for everyone else now ships itself.

The first time I noticed, it felt like a small magic trick. Then it felt like proof. I let it build itself because I trust it, and I only trust it because of how it's built.

Here's the uncomfortable part. I ship code I didn't read line by line. Not all of it, and not the parts that matter most (more on those later), but far more than I'd have admitted two years ago. And I sleep fine.

My trust moved. It came off the code the agent writes and went onto the harness I built around it. The model is the only non-deterministic line in an otherwise deterministic script. I trust the script.

You can't read it all, and "read it all" was never the plan

AI assistance now touches a lot of the code that reaches main. Sonar's 2026 survey puts AI-assisted code at around 42% of what developers commit, by the developers' own estimate. Of that same group, 96% say they don't fully trust that the output is correct, and only 48% always verify it before committing.¹

That gap is the whole problem, and the right reading of it isn't "developers are lazy." Line-by-line review never scaled to this volume. Pretending it does just means the reading gets shallower while everyone tells themselves it didn't.

"Never trust it" and "just trust it" are both wrong

The two loudest answers are both dead ends. "Never trust AI code" throws away the leverage and doesn't survive contact with the volume. "Just trust it" is how unreviewed nonsense reaches production.

The smarter answer, the one I keep seeing from people who actually ship with agents, is "build guardrails, review the spec not the diff." That's right, and it's half the answer. It tells you to move trust to a harness. What it skips is how the harness earns that trust, and the fact that the harness lies too.

Trust didn't vanish. It moved.

Here's the reframe that made it click for me. The AI provider is just another unreliable I/O boundary, like a flaky network call to a service you don't control. You already know how to handle one of those. You don't trust a network call. You wrap it in timeouts, contracts, retries, and tests, and then you trust the wrapper.

Do the same to the model. Make each operation a deterministically callable unit: a script that calls the model through a clear interface and takes a structured result back. It hands that result to the next step and verifies it with executable checks. The model in the middle is stochastic. Everything around it is deterministic, inspectable, and yours. That wrapper is the harness.

I don't trust the AI. I trust the pipe I built around it.

The pipe in practice: gates, an evaluator, and signals

In ralphctl, a sprint is a sequence of deterministic steps, not one big "go" prompt. A verifyScript gate brackets every task: work starts from a verified-green baseline and has to end green (typecheck && lint && test), so "the agent finished" and "the code is still green" stay two separate facts. A generator model produces the work, a separate evaluator model grades it against explicit criteria, and structured signals pass between stages through a file-based signals.json contract instead of vibes.² Clear interfaces in, clear signals out, executable verification at each seam.

The model never gets to be the thing I trust. It gets to be one step I can re-run, inspect, and gate. That pipeline is one station in a bigger delivery harness that orchestrates the whole path to production, but the trust argument is the same at every step.

There's one more part no vendor's guardrails pitch can offer you: I built this one. I know its contracts and its failure modes because I wrote them. That's the difference between trusting a harness and trusting a black box that happens to be called a harness. You can borrow someone else's, but then the homework is understanding it well enough to know where it'll let you down. The source is on GitHub if you want the shape of it.

The harness lies too

This is the part the "just build guardrails" crowd skips, and it's the honest center of the whole thing: the scaffolding isn't infallible. Trust moved, it didn't disappear.

My own evaluator can be confidently wrong. A weaker evaluator model will sometimes tell a stronger generator to rewrite code that was already correct, and say so with complete conviction. I cap that retry loop on purpose (a bounded turn budget, and a red verdict never blocks the run), because a confident wrong "fix" is a real failure mode, not a hypothetical. Green can mean nothing. Red can be noise.

Green is also necessary, not sufficient. The deterministic checks catch a whole class of failures in seconds, but some confidence only ever comes from manual steps I do myself. The clearest case for me is visual design. I'm a backend developer, not a designer, yet I can tell when something works and still looks wrong. An agent usually hands me a component that renders and passes its tests but needs a human eye before I'd show it to anyone, because no check encodes "this feels right." That's not an AI thing: every piece of software still needs a few manual passes for the final polish that no executable test captures. The harness gets me to a candidate. I get it to shippable.

So I don't trust the harness blindly either. I trust it the way I trust any system I built: knowing exactly where it's weak, and where it ends.

Where I still read every line

So what stays on the no-skim list, no matter how green the checks are? A few things, every single time.

Anything that costs money. Not just payment code in the obvious sense, but anything that drives spend, including how the app uses models: which model a path picks, how many times it calls, and the tests that pin that down. A broken feature fails loudly the first time someone runs it. A loop that quietly calls an expensive model ten times, where one call would do, fails on the invoice weeks later, and the harness reported green the whole way.

Security and authorization. When I'm not a hundred percent sure an auth path actually works, I check it myself: how roles are evaluated, how the identity provider is wired (Keycloak or otherwise), how security is configured across the app. This is where "the code runs" and "the code is correct" sit furthest apart, and no verifyScript catches "this works perfectly and authorizes the wrong person."

Anything irreversible, data migrations above all. Those get tested locally more than once, but testing isn't really the point. Before a migration reaches production I want to know what's hot: what could fail, how big the blast radius is, and what the plan is when it does. If I've read it, I can write tests for the scary cases up front and walk in prepared, instead of reverse-engineering a 2am incident.

The agent can often do all three. I read these to keep my own model of the system intact, so when something breaks I already know why, and I still know what to expect from my own application. The day I can't say how authorization flows through my app, or what a migration will do under load, is the day I've outsourced something I shouldn't have.

The skill now is knowing which layer to distrust

The new discipline is knowing which layer to distrust, and owning the one you trust. Wrap the stochastic step in a deterministic shell. Build that shell, or, if you borrow one, understand it well enough to name its weak spots.

The models will keep getting better. The wrapping is what stays. I trust ralphctl enough to let it build itself, and I distrust it exactly where I know it's blind. That's not a contradiction. That's the job now.

Sonar, Sonar Data Reveals Critical "Verification Gap" in AI Coding: 96% Don't Fully Trust Output, Yet Only 48% Verify It (2026 State of Code Developer Survey; fieldwork Oct 2025, 1,149 developers per the full report). Source of the 96% / 48% trust-versus-verify split and the 42% AI-assisted share-of-commits figure. ↩
ralphctl behaviour claims here are drawn from the project's CHANGELOG and npm package (MIT): the file-based signals.json contract, the generator-evaluator defaults (Claude Opus 4.8 generator, OpenAI Codex / GPT-5.5 evaluator), and the bounded, non-blocking evaluator retry. ↩

Frequently asked questions

Should you read every line of AI-generated code before merging?

Not all of it, and reading everything stops scaling once an agent writes most of your code. The workable rule is to read every line of the high-blast-radius parts (auth, security, data migrations, anything that costs money), trust executable verification for the rest, and keep a few manual confirmation steps for the final shippable-to-a-real-user polish.

Why does trust belong in the harness instead of the model?

The harness is everything around the model: the scripts, interfaces, signals, and checks that turn a stochastic model call into a repeatable, verifiable pipeline. Those parts are deterministic, inspectable, and yours, so they can earn trust the way any system you built does. The model stays one step you can re-run, inspect, and gate.

How can you trust code an AI wrote?

You don't trust the model. You wrap it like any unreliable I/O boundary: a deterministic step calls the model through a clear interface, takes a structured result, hands it to the next step, and verifies it with executable checks. You trust the deterministic shell, not the stochastic core.

Resources

Repository

RalphCTL

The agent harness this post is about. Deterministic sprint steps wrapping a generator-evaluator loop across Claude Code, OpenAI Codex, and GitHub Copilot.

aiai-agentsdeveloper-toolingcliopen-source

Tool

ralphctl on npm

Install ralphctl globally via npm.

aideveloper-toolingclinpm

Article

Sonar: the AI-coding "verification gap"

96% of developers don't fully trust AI-generated code, yet only 48% always verify it before committing. The trust-versus-verify gap behind this post.

aideveloper-toolingsoftware-delivery

Article

Harness design for long-running application development

Anthropic Engineering on the generator-evaluator pattern: separate the model that produces work from the one that reviews it. The basis for ralphctl's evaluator.

aiai-agentsdeveloper-tooling

Share this article

Enjoyed this article?

Stay in the Loop

Get notified when I publish new articles. No spam, unsubscribe anytime.

Trusting AI-generated code: the harness, not the model

Key takeaways

The last few releases of ralphctl were built, in part, by ralphctl

You can't read it all, and "read it all" was never the plan

"Never trust it" and "just trust it" are both wrong

Trust didn't vanish. It moved.

The pipe in practice: gates, an evaluator, and signals

The harness lies too

Where I still read every line

The skill now is knowing which layer to distrust

Frequently asked questions

Should you read every line of AI-generated code before merging?

Why does trust belong in the harness instead of the model?

How can you trust code an AI wrote?

Resources

RalphCTL

ralphctl on npm

Sonar: the AI-coding "verification gap"

Harness design for long-running application development

Stay in the Loop

More Posts

What a ralph loop actually is, and how to run one without burning your tokens

Earning the overnight run: ralphctl from 0.8 to 0.13

The harness era caught up: ralphctl and the convergence I bet on

From sprint CLI to agent harness: how ralphctl got an evaluator

Key takeaways

The last few releases of ralphctl were built, in part, by ralphctl

You can't read it all, and "read it all" was never the plan

"Never trust it" and "just trust it" are both wrong

Trust didn't vanish. It moved.

The pipe in practice: gates, an evaluator, and signals

The harness lies too

Where I still read every line

The skill now is knowing which layer to distrust

Footnotes

Frequently asked questions

Should you read every line of AI-generated code before merging?

Why does trust belong in the harness instead of the model?

How can you trust code an AI wrote?

Resources

RalphCTL

ralphctl on npm

Sonar: the AI-coding "verification gap"

Harness design for long-running application development

Stay in the Loop

More Posts

What a ralph loop actually is, and how to run one without burning your tokens

Earning the overnight run: ralphctl from 0.8 to 0.13

The harness era caught up: ralphctl and the convergence I bet on

From sprint CLI to agent harness: how ralphctl got an evaluator