Guide

AI Agent Harnesses: A Field Guide

An AI agent harness is the layer around a coding model (Claude, GPT, or similar) that turns one-shot prompting into a repeatable workflow. Instead of asking a model for code and hoping it's right, a harness runs a loop: plan the work, generate a change, then check it against gates like tests, type checks, linters, or a reviewer model before calling it done. The model writes the code. The harness decides whether to ship it, try again, or stop.

This guide is what I picked up building ralphctl, a sprint CLI for AI-assisted coding, while the idea grew from a glorified prompt runner into a real harness with a generator-evaluator loop. Read it start to finish, or skip to whatever you're wrestling with.

Key ideas

  • The loop matters more than the prompt. A harness owns the whole cycle (plan, generate, evaluate, verify), so its output answers to the same checks a human engineer would run.
  • Generator and evaluator. One model proposes a change; a second model, or just a deterministic gate, critiques it and sends it back for a supervised retry when it falls short. That's what catches the plausible-but-wrong code single-pass generation waves through.
  • Check gates decide "done." Tests, type checks, and linters define when a change is finished. The model's confidence doesn't.
  • Agent harness vs delivery harness. The agent harness gets a correct change produced and verified. The delivery harness is the bigger pipeline around it: requirement, plan, ship, review.

In this guide

  1. Building RalphCTL: a sprint CLI for AI-assisted coding — I built a sprint CLI for Claude Code and GitHub Copilot. The problems I didn't see coming and what it took to make AI-assisted coding feel like a real workflow.
  2. From sprint CLI to agent harness: how ralphctl got an evaluator — Anthropic's first harness article inspired ralphctl. Their second one on generator-evaluator loops pushed me to build an evaluator for v0.2.0, and the tool changed identity.
  3. The delivery harness: your agent writes code, but who's shipping it? — Agent harnesses orchestrate code generation. Delivery is a longer pipeline. Here's the mental model I've been using to plug team, tools, context, and models into one value stream, and where the agent harness fits inside it.
  4. The harness era caught up: ralphctl and the convergence I bet on — When I started ralphctl, 'harness' was a word from one Anthropic post. Now it's the third phase of AI engineering, there's an arXiv paper, and the whole field has standardized on the patterns I built early.

Frequently asked questions

What is an AI agent harness?

An AI agent harness is the layer around a coding model that turns one-shot prompting into a repeatable loop: plan, generate, evaluate, verify. Check gates like tests, type checks, and linters decide when a change is actually done. The model writes the code; the harness decides whether to ship it, retry, or stop.

How is an agent harness different from a coding assistant?

A coding assistant answers a prompt; a harness owns a workflow. It controls the loop, feeds the model context, runs the gates, and retries when something fails, so the result answers to the same checks a human engineer would run instead of to a single model reply.

What is a generator-evaluator loop?

A generator-evaluator loop splits the work between two roles: a generator proposes a change, and a separate evaluator (another model, or just a deterministic gate) judges it against explicit criteria. The change only moves forward when the evaluator passes it; otherwise it goes back for another supervised attempt. That's how it catches the plausible-but-wrong output single-pass generation lets through.

What is the difference between an agent harness and a delivery harness?

An agent harness handles code generation: getting a correct change produced and verified. A delivery harness is the longer pipeline around it: turning a requirement into a plan, shipping the change, then running review and release. The agent harness is one stage inside it.

← All guides