The delivery harness: your agent writes code, but who's shipping it?
Agent harnesses orchestrate code generation. Delivery is a longer pipeline. Here's the mental model I've been using to plug team, tools, context, and models into one value stream, and where the agent harness fits inside it.
The agent wrote the code. Now what.
I've spent the last couple of weeks building and using an agent harness. If you've read the ralphctl posts, you know the story: sprint CLI, generator and evaluator loop, headless Claude Code runs, multi-repo sprints, check scripts that tell me whether the work actually passed. It works. I ship real features through it. Some mornings I wake up to eight completed tasks and spend an hour reviewing them instead of writing them.
And yet. There is a gap I keep walking into.
The agent harness produces code. That is what it is for. But "code written" is not "feature shipped." Somewhere between the prompt and the production deploy there are still five or six things that need doing, and my harness has opinions about exactly zero of them. Business analysis. Architecture. UI and UX. QA. DevOps. Operating the thing once it's live. An entire SDLC wrapped around the one phase I automated.
So I kept asking myself: what about the rest of the value stream?
The SDLC did not change. The mental model did.
Whenever I talk to people about this, the conversation drifts toward "AI changes everything." It doesn't. The SDLC is still the SDLC. You analyse, you design, you build, you verify, you ship, you operate. Open any textbook definition of the lifecycle, from GitHub's explainer1 to AWS's2, and you get the same loop, same phases, same reasons they exist. That loop has survived waterfall, agile, DevOps, microservices, the cloud, containerization, and it will survive this one too.
What's changing is the mental model. How we frame the work. Where cognitive effort moves. Which parts a human sits with and which parts an agent chews through overnight. None of that is the first mental-model shift software has gone through. Containers were one. Continuous delivery was one. Agile was a big one, and the arguments still haven't settled down.
This is just the next one. Feasible, not trivial.
Agent harness vs delivery harness
Here is the distinction I keep coming back to.
An agent harness orchestrates models to produce code. You give it a task, it plans, it executes, it verifies. That is a local optimum. Ralphctl is mine. Claude Code is one underneath it. There are plenty of others.
A delivery harness is a different thing. It orchestrates the whole value stream, from vague idea to deployed product. The agent harness is one station inside it. Code generation matters, but it matters about as much as the CI pipeline mattered in 2015: a necessary piece, not the interesting one.
When I put it that way it almost sounds obvious. The thing that surprised me is how rare it is to see teams think about delivery this way. Most are still stuck at "Copilot makes me 20% faster at typing." That is a tool conversation, not an architecture conversation.
What plugs into the harness
A harness is machinery. You plug things into it. In mine, there are four slots.
Team
People with roles and capacities. Architect, analyst, designer, engineer, QA, DevOps, product manager, whoever else is on the manifest. Each role contributes cognitive work. The point isn't whether a role is "replaceable by AI" (mostly a useless question). The point is which role owns which artifact and what good looks like when that artifact lands.
Capacity matters too. People have days off. Context switches hurt. A senior engineer doing three things in parallel is producing at maybe 60% of what they'd produce on one focused problem. The harness has to model that, or the plan it spits out is fiction.
Tools
IDEs, coding CLIs, linters, static analysis, build systems, CI, observability stacks, ticketing systems, chat, wikis. And yes, AI assistants and agents. They are tools too, just a new kind.
The thing I want out of a tool in this picture is that it takes something in and produces something the next role can read. Preferably as text. A Jira ticket, a Figma export with notes, an ADR, a diff, a deploy log. Text is the interoperability protocol. Anything that ends up as a binary blob or a screenshot nobody describes is a dead end for the rest of the pipeline.
Context
Context was the word of 2025 and it's doing a lot of heavy lifting. In this frame I mean it literally: the text that describes the problem, the product, the environment, the constraints, the decisions, the history. Requirements, ADRs, runbooks, domain glossaries, postmortems, the annoying Slack message where someone explained why the payment provider does the thing.
Andrej Karpathy's "LLM wiki" gist3 is the framing I keep coming back to. His setup is "Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase." The wiki is a persistent, compounding artifact. Cross-references already there. Contradictions already flagged. Humans curate sources and ask questions. Models do the summarising and the bookkeeping. Both read and write the same files. That is context as shared substrate, not a pile of PDFs in SharePoint and not prompt templates stashed in a folder.
Something like MCP wires that corpus into each role. The analyst's agent reads the parts it needs, the QA agent reads its own slice, the architect's agent writes new ADRs back into the same place. Each role gets the tools that fit it, pointed at the context that matters to it, with write-back where it makes sense. The corpus is shaped per role: the analyst section does not look like the runbook section, even if they share vocabulary. Same underlying text, different views on it. Written for models to consume, but still readable by a human at 3am when something is on fire.
Context decays if you do not maintain it. So does a codebase, but codebases have compilers. Context has nothing except the discipline of the team writing it down.
Models
Chat models, embedding models, rerankers, vision models. Each role inside the harness reaches for different ones. The architect might want something deep that can chew through an RFC. The QA agent might do fine on a cheaper model with tight prompts. The embedding model under your retrieval layer is a different choice from the one generating the code.
This is already how I think about ralphctl internally. Opus for refine and plan, cheaper models for implementation and evaluation. A model ladder. The delivery harness is the same idea, just wider: each station on the value stream has its own model profile, not "pick one LLM for everything."
The arch above all four: the value stream
If team, tools, context, and models are the vertical slots, the value stream is the horizontal arch sitting over them. The picture in my head looks roughly like this:
analyse -> design -> build -> verify -> ship -> operate
^ ^ ^ ^ ^ ^
| | | | | |
v v v v v v
=================== shared context bus (MCP) ===================
Context is not one box on the diagram. It is the bus underneath. Every station reads from it and writes back to it. MCP (or whatever successor protocol ends up winning) is the wiring. It exposes role-appropriate slices of the corpus to role-appropriate tools. The analyst's agent does not need to see the runbooks. The QA agent does not need the revenue model. Both need to write back what they learned, so the next station starts with something better than last time.
Martin Fowler has been pointing at this piece for years, under a different name. His definition of delivery4 is "the steps from a developer finishing work on a new feature, to that feature being used in production." Continuous delivery, value stream thinking, DevOps, all attempts to get the whole arch moving together instead of six disconnected stations. The delivery harness is the same instinct updated for a world where some of the stations are staffed by agents.
Each station produces cognitive work, and the best version of that work, for the purposes of the rest of the pipeline, is text. LLM-friendly, human-friendly. Whether a human produced it, or an AI agent, or a pair of the two, matters only secondarily. The work has to be done. The form matters more than the author.
That last bit is the part that sneaks up on people. If your analyst produces a Word document buried in SharePoint, your designer produces a Figma file with no notes, and your architect produces a wall of diagrams with no prose, the agent harness at the build station has nothing to eat. Garbage in, garbage out, scaled up by token counts.
Cross-cutting concerns live in the tools
Security, privacy, compliance, observability. These cut across every station. You do not want each role or each agent to re-solve them.
This is where the tooling layer earns its keep. Put the data classification in the tools that ingest. Put secrets handling in the tools that deploy. Put the redaction pipeline in the tools that feed models. Put auditability in the tools that orchestrate. The prompt is not where you enforce GDPR. The prompt is where you remind the model to behave because you already enforced GDPR upstream.
Run the thought experiment. A team puts "please do not include PII" at the top of an agent prompt and ships it. The model reads the instruction and obediently redacts what it sees as sensitive in the conversation, then happily quotes back PII that reached it through a retrieval tool, an MCP resource, or a file the orchestrator loaded. The prompt did exactly what it said. The prompt was not the right place for the rule. The fix is boring: scrub the inputs before they reach the model. Once the tool owns it, the prompt no longer has to.
Privacy and licensing are the cross-cutting concerns most legal reviews get stuck on. Worth being concrete about what they mean inside a harness.
What is sent to whom. Every prompt is an egress event. The useful question is not "do we use AI," it is "which bytes leave our tenant, into which endpoint, hosted where." A code suggestion that round-trips through a US-hosted model is a different risk profile from the same model served out of Frankfurt. The harness has to make that choice visible per station, not bury it in a default.
Where context is stored. The shared context bus is great until you ask where it lives. A vendor-hosted vector store full of your customer's domain data is a contract with that vendor. A self-hosted MCP server against a local knowledge base is a different contract with yourself. Neither is wrong. Both have to be a deliberate call, not an accident of whichever tutorial someone followed on a Thursday.
Licensing and agreements. Model terms of service vary wildly. Training-data opt-outs, retention windows, audit logs, indemnification for generated code, all of it lives in the fine print. On top of that, your own customer contracts constrain what can leave their tenant. If the contract says "no customer data leaves our infrastructure," the harness is the thing that enforces it, by refusing to route certain slices of context to certain models.
The harness does not decide any of this for you. Its job is to make it configurable. Different customers get different wiring. A bank and a consumer SaaS will run the same value stream with completely different egress rules, and whoever looks at the config should be able to see which is which.
What a concrete harness looks like
If I sketch the delivery harness for a team shipping a typical SaaS product, it ends up something like this. Yours will look different, which is the point.
Analyse. Product managers and analysts own this, with an MCP-connected agent that reads the ticketing system, pulls customer support transcripts, and drafts a requirements brief. The brief goes into the context corpus. Model: something reasoning-capable, called sparingly.
Design. UI and UX folks, with an agent that can read the context and the design system over MCP, draft flow diagrams, and produce a short design rationale as text. Figma still renders the pixels, but every decision is also captured in prose next to it, written back into the corpus.
Build. The agent harness. Ralphctl in my case. Takes refined tickets, generates tasks, executes them across repos, verifies with a check script, evaluates the output. This is the one station already well-tooled today. Its agents read the same corpus the analyst wrote into.
Verify. A QA agent that drives integration tests against the preview deployment, reads the refined requirements, and surfaces gaps. Humans still own exploratory testing, accessibility, and anything with real stakes.
Ship. Your CD pipeline, with an agent that can write release notes, tag versions, and draft the incident-response runbook if something is different in this release. DevOps engineers own the pipeline itself.
Operate. Observability plus an agent watching the feedback loop: log patterns, metric shifts, support tickets. Summarised back into the context corpus as text, so the next analyse phase starts with real data.
Every station reads and writes the same context corpus, over a protocol like MCP that makes role-appropriate slices available to the tool best suited for that role. That is the spine of the thing. Without it, you have six disconnected tools and a lot of hope.
Notice what this does not say. Nothing here forces a human out of a station. If your designer wants to keep sketching in Figma and skip the agent entirely, the station still produces its artifact, the context corpus still gets written, and the next station still has something to eat. AI-augmentation is a slot, not a requirement. The harness cares about the shape of the output, not who produced it.
Why this matters now
Models are getting more capable across domains, not just code. The jump from "write me a function" to "read this PRD, reconcile it with the ADRs, draft the verification criteria" is smaller than it sounds. We're already there for some tasks. We're close enough that designing for it makes sense.
And here is the inconvenient part: the productivity gain from an agent harness alone caps out. You can only write code so much faster than you can think about what code to write. The next order of magnitude isn't in the build station. It's in the rest of the stream.
That's what makes "delivery harness" worth naming. It puts the conversation in the right place. Not "which IDE plugin should I use" but "what does our value stream actually look like, and which parts are ready to be augmented."
The machinery, not the magic
I don't think "delivery harness" is a new category of software. I don't want to sell you a product. It's a mental model, and a framing I've found useful when I'm sitting with a team trying to figure out what to do next.
The pieces exist. Your team. Your tools. Your context. Your models. Your value stream. The harness is what you build to plug them together. Nobody else's harness will fit your org because nobody else's constraints are yours.
The agent harness is one station on that stream. It's the one I happen to have built. The next interesting work, for me at least, is the stations on either side of it.
I'll write about those as I build them. Expect the word "context" to show up a lot.
Footnotes
-
GitHub, What is SDLC?. Textbook definition of the lifecycle phases, used here only to anchor the claim that the loop has not changed. ↩
-
AWS, What is SDLC?. Same loop, different vendor. Worth noting the two biggest dev-tool vendors agree on what the phases are even while disagreeing on most other things. ↩
-
Andrej Karpathy, The LLM wiki (gist). Argues for treating the wiki as a persistent, compounding artifact that humans and models both read and write: "Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase." ↩
-
Martin Fowler, Software Delivery Guide. Defines delivery as "the steps from a developer finishing work on a new feature, to that feature being used in production." The value-stream framing in this post borrows heavily from Fowler's continuous delivery writing. ↩
Resources
Enjoyed this article?
Stay in the Loop
Get notified when I publish new articles. No spam, unsubscribe anytime.