Estimation in the age of AI (or: how many prompts is a delete dialog?)
Suppose you're in a sprint planning session. Someone opens the ticket: "add a confirmation dialog before deleting a record." Four engineers, a PM, a designer, a tech lead. On the whiteboard behind you, someone has written the word "velocity" in big letters, circled twice.
Sounds like an afternoon. Maybe a 3.
Except then someone asks whether delete is permanent or soft. Nobody is sure. The PM says "just a dialog" but the designer wants to check the design system first. One engineer mentions that the mobile app handles delete differently. Another quietly opens a new tab to look at what Notion does. Fifteen minutes in, you are now having a product discussion about soft-delete, undo flows, data retention policy, and, this is real, "what does delete even mean for our users."
The ticket is still open on the screen. It still says "add a confirmation dialog."
You settle on 8. Deliver it in a day and a half. Three weeks later someone files a bug because the dialog doesn't appear on the bulk-delete endpoint, which apparently exists, and nobody in that planning session knew about it.
Sound familiar? If not, give it a sprint or two.
Now add AI to the picture. Same meeting, same ticket, same confused room, but now someone says: "well, if I use Claude it's probably a 3, but if I pair-program it with a junior it's still an 8, and if we factor in the review time it's back to a 5, maybe." Progress.
Next sprint planning, ask your team: what exactly are we pointing? If nobody can answer without looking at each other nervously, this post is for you.
What are we even estimating?
The first honest question in any estimation session is the one nobody asks out loud. Are we estimating complexity? Effort? Calendar time? Risk? All of the above, blended into a single vague number so management has something to put in a Gantt chart?
In classical Scrum, story points were designed to capture complexity plus uncertainty plus effort, all properties of the work itself, largely independent of who does it (within the team's velocity baseline). That was already a stretch, but it held because human engineers are roughly fungible at the skill level a team shares. An 8 for you was an 8 for me. Same hours, same coffee supply, same limited typing speed.
AI-augmented engineering breaks that assumption. The same ticket now has very different effort profiles depending on: how well the developer can prompt, which model they're using (Sonnet vs Opus vs Haiku), whether the repo has a CLAUDE.md or any reasonable context setup, whether the task fits in a context window, and whether the agent can run autonomously or needs hand-holding every third step.
Ron Jeffries, who had a hand in inventing story points, has since said he wishes he hadn't.1 His argument: the moment you start comparing velocity between teams or treating points as a promise, you've lost the plot. The quote I keep coming back to: "A team that is focusing on velocity is not focusing on value." That stings, because I've been on teams that lived and died by the burndown chart.
So what exactly are we estimating? Nobody is quite sure, and that's the awkward silence at the center of most AI sprint planning sessions.
Estimation is really just disguised experience
Here's the part that took me embarrassingly long to notice. Most of what we call estimation isn't estimation at all. It's scoping. An engineer who has already built the same endpoint three times isn't guessing how long a fourth one will take. They're remembering. A standard REST endpoint with a SQL query behind it has a known shape, a known footprint, a known set of surprises. That's not a forecast. That's a recipe.
The uncertainty only really lives in the things you haven't done before. A new algorithm that doesn't exist yet. A performance optimization where you genuinely don't know which layer is slow. A migration across two systems that have never spoken to each other. Those are the tickets where the number on the card is a polite fiction wrapped around "we'll find out."
Then there's the other question nobody wants to sit with: am I estimating the complexity of the task, or the complexity of navigating somebody else's mess? Because the answer changes everything. "Add a field to the user model" is a 1 in a clean codebase and a week in one where the user model has sixteen indirect dependencies, three of which are in a repo you don't have access to yet. Same ticket, same words, two different planets.
AI is supposed to help here, and in refinement it sometimes does. A decent agent can read a ticket from a few angles and flag missing requirements, edge cases, questions the humans in the room forgot to ask. I've seen it work. What I'm less sure about is whether clearer requirements actually help you estimate the effort, or whether they just expose how much you didn't know you didn't know. Both outcomes are useful. Only one makes the planning meeting feel better.
The scale wars
If you've worked in agile for longer than a year, you've probably sat through at least one scale debate. The candidates usually go like this.
Numbers (1 to 10). Feels natural, but people anchor on "5 is average" and you end up with a cloud of fives. Useless.
Fibonacci (1, 2, 3, 5, 8, 13, 21). Sounds smart. The gaps force you to commit to a rough bucket. I like it, but I've also seen people spend ten minutes deciding whether something is a 13 or a 21, which defeats the whole point.
T-shirts (S, M, L, XL). Lower ceremony, great for early discovery. Falls apart the moment finance asks "how many L's are in a month." Also, an XL at H&M is not the same XL at Ralph Lauren, and your team will find this out.
Then there's the creative tier.
Fruits. Legend has it a team somewhere used fruits. A banana was small. A pineapple was big. I've never been able to get a straight answer on what a mango was supposed to mean, and I suspect the people who used it couldn't either.
Pizza sizes. Imagine a team that picked personal, medium, large, catering order. A catering order is obviously supposed to be a red flag for splitting the ticket. I imagine they kept shipping catering orders anyway, because of course they did.
Dog breeds. Chihuahua to Great Dane. I'm going to be honest, I made this one up just now, but I'm confident it has happened somewhere, and I'm equally confident that two people on that team had very different lived experiences of what a Labrador actually is.
None of these scales are wrong. They're all different costumes for the same conversation: roughly how scared should we be of this ticket. Pick the one your team uses without complaining, and move on.
The temporal ceremony parade
Scrum has its liturgy. You plan at the start of the sprint. You stand up every morning to report what you're doing. You refine the backlog somewhere in the middle. You demo at the end. You retrospect right after. In theory, it's a closed loop. In practice, most teams I've worked with treat three of those as optional and the other two as mandatory but rushed.
The ceremony that actually earns its keep for estimation is refinement, and of course it's the one people tend to skip. Refinement is where you break tickets down small enough to estimate honestly. Skip it and planning becomes a guessing contest. Do it properly and estimation almost takes care of itself, because everyone in the room already shares context.
The connection to what comes next should be obvious: once AI enters the conversation, you need a shared vocabulary for the thing you're estimating. And most teams don't have one.
Six units, six ways to be wrong
Once AI entered the picture, the estimation unit conversation got genuinely complicated. Here are the candidates, with a note on where each one breaks.
Story points (complexity). The classic. Complexity is no longer correlated with developer effort when an agent does the heavy lifting. A "13" in complexity might take a skilled prompter twenty minutes to set up and review. The same ticket might take a less experienced prompter two days of iteration. Same complexity, completely different effort. The number lies.
Developer hours. What management actually wants. Time collapses with AI assistance. A feature that used to take ten hours of careful coding might need twenty minutes of prompting and two hours of review. Your historical data is now historical fiction. Your planning poker habits, built on years of developer-hour intuition, are calibrated to a world that no longer exists.
Prompt effort. How skilled does the prompting need to be? Is this a single-shot request or a twenty-turn negotiation with three model restarts? Sounds reasonable until you try to estimate it before running the agent. You can't. The only way to know how hard something is to prompt is to start prompting it. Which is exactly what estimation was supposed to let you avoid doing first.
Token budget. The compute cost of the agent run. Useful for cost forecasting. Completely disconnected from business value and sprint velocity. "This feature costs 1.2M tokens" means nothing to a product manager, and rightfully so.
Agent autonomy score. How much hand-holding does the agent need on this ticket? Low autonomy means the developer is effectively the agent, back to hours. High autonomy means the developer reviews output and ships. A useful dimension to think about, but not a planning unit. You can't put "medium autonomy" on a Gantt chart and have anyone know what it means.
Verification effort. Arguably the real cost of AI-assisted engineering: the time spent reviewing, testing, and correcting agent output. Especially high for auth systems, payment logic, or anything where the agent's confident wrong answer causes actual damage. The problem is you can't estimate this without running the agent first and seeing what it produces. This is also the cost that gets quietly forgotten in every "5x faster" pitch, and the one that bites you.
The cone that nobody looks at
Steve McConnell's Cone of Uncertainty2 is one of those ideas that sounds obvious once you've seen it, and is still weirdly absent from most sprint planning rooms. The short version: early in a project, your estimates are allowed to be wrong by a factor of two to four. As you learn more, the cone narrows. By the time you're halfway through, a decent team should be within plus-or-minus twenty percent of the truth.
The shape repeats at every scale. It shows up across an entire product lifecycle, across a single quarter, and across a single sprint, where on Monday morning I have no idea how the sprint will end, by Wednesday I have a much better idea, and by Friday if I still don't know, something is wrong.
The part McConnell is careful about, which gets lost when people retell it, is that the cone is a best case. Not a guarantee. It's roughly the smallest error a skilled team can expect, and you can absolutely do worse.
The useful thing about the cone isn't that it makes you precise. It gives you permission to be imprecise on purpose. When a stakeholder asks for a firm number in week one of a six-month project, "plus or minus three months" is the honest answer, and you can point at the cone to defend it without sounding like you're dodging the question.
Then AI showed up, and things got worse
For years, agile writers have been telling teams that velocity is a trap, that story points are a communication tool rather than a contract, that estimation isn't prediction. Teams mostly nodded and kept doing it the old way because management liked the numbers.
AI was supposed to help. Copilot, Claude, Cursor, the whole cast. "We can ship features 5x faster" was basically the slogan of 2024. And sometimes it's true. I've used Claude Code to knock out a migration in an afternoon that I had mentally budgeted for three days. I've written about what that workflow actually looks like in practice if you want the details. The short version: faster, yes, but not magic, and the estimation problem was still there waiting at the end.
But the tools didn't fix estimation. They broke it a little more.
Because now the question isn't just "how complex is this ticket." It's also: which engineer is picking it up and how comfortable are they with the stack? Which model are they using? Does the repo have a CLAUDE.md or any context setup at all? Does the task fit in a context window, or will the agent start hallucinating file paths halfway through? Will it run autonomously or need guidance every third step? How much review time will the AI-generated code actually need?
The 2025 DORA report3 puts AI tool usage among developers at 95%. Over 80% say it makes them more productive. Delivery performance, meanwhile, hasn't moved.4 Nobody has figured out how to plan work around that gap.
We're not estimating work anymore
Here's the thesis I've landed on, and I'm not certain it holds: we're not estimating work anymore. We're estimating unknowns.
In AI-augmented teams, the question shifts from "how hard is this to build" to four different things:
Context readiness. Is the codebase agent-friendly? Clear module boundaries, reasonable test coverage, enough documentation that an agent can navigate without hitting dead ends every other turn. Poor context readiness turns a 3 into a 13 without warning. Making this an explicit refinement question is, I think, one of the higher-leverage habits a team can pick up.
Verification complexity. How risky is it to trust the agent's output without deep review? A UI copy change is low. Payment logic is high. Auth is high. Anything touching user data is high. This maps to "how bad is it if the agent is confidently wrong?" The answer should directly influence how you estimate the ticket.
Prompt iteration budget. Single-shot or multi-turn negotiation? Features that sit in the agent's strengths (refactoring, test generation, well-specified migrations, boilerplate) tend toward single-shot. Features requiring nuanced judgment or unclear requirements tend toward negotiation. You usually know which one you're dealing with after proper refinement. If you don't, refinement wasn't done.
Human judgment gates. Which decisions can't be delegated? Architecture choices, product direction, edge case definition, anything where the stakes are high and the agent's confident answer is the wrong one. These cost real human time regardless of how good the tooling is, and they're the first thing to disappear from estimates under deadline pressure.
And then there's the quiet one nobody wants to open: code quality. If the agent produces something that compiles, passes the tests, and reads like a cursed funhouse on the inside, you've shipped a ticket and quietly made the next one harder. The estimate for whoever touches this code next is now invisible debt, paid by whoever gets unlucky in the next refinement. I'm not going to pretend I have an answer for this. I'm just flagging that the problem exists, and that every team that doesn't look at it is going to meet it eventually.
A rough heuristic: green, yellow, red
Since nobody has a clean framework for this yet, here's one worth trying. Not a methodology. A conversation starter.
Green. The agent handles this autonomously. The developer sets up context, runs the prompt, reviews the output, and ships. Single-shot or close to it. Low verification complexity. Examples: adding a field to an existing form, writing tests for a well-defined function, updating dependencies, generating boilerplate from a clear spec.
Yellow. Multi-turn prompting. The agent does the heavy lifting but needs human direction at key decision points. Verification effort is moderate. The developer needs to understand what was built, not just whether it compiles. Examples: a new API endpoint with non-trivial business logic, a UI component requiring design judgment, migrations touching shared state.
Red. Human-first. The agent is support, not driver. Architecture decisions, security-sensitive code, features with unclear requirements, anything where a confident wrong answer causes real damage. The agent helps with research, scaffolding, and code generation within a human-designed structure. The human makes the calls.
When you estimate a ticket, say out loud which color it is. "I think this is Green, Claude can probably handle it autonomously and I'll review." "This feels Yellow, I'll need to guide it through the tricky part." "This is Red, I need to sit with it." That's more useful than a number, and it forces a conversation about risk and knowledge rather than just effort.
The unit conversion that never ends
No matter what unit you pick, someone eventually wants a date. This is the part of estimation that nobody enjoys and that every scale pretends it can avoid.
Story points become ideal days become calendar days become "so, end of Q2, right?" The conversion is never clean, because half the time gets spent on things nobody estimated in the first place. Meetings. Reviews. The PR that sat for four days waiting on a reviewer who was on vacation. The production incident nobody saw coming.
AI doesn't save you from any of this. It compresses some parts of the cycle (typing code) and leaves the others untouched (figuring out what to build, reviewing it, deploying it safely, talking to humans). The ratio of "thinking time" to "typing time" quietly shifts, and your old point-to-day conversion no longer holds.
The "it depends" problem
Every estimation framework has a caveat buried in the footnotes: "this works if your team is stable, your backlog is refined, your product owner is present, and your organization actually respects the retrospective." It works if everything else is already working.
When it doesn't work, somebody always stands up in the retro and blames "the process." There's a specific archetype here, and every team has one. The person with strong opinions about why the current setup is broken, no concrete alternative, and a vibe where a proposal should be. They want the numbers to feel better without changing how the work actually happens. They're not wrong that something is off. They're just not the right person to be holding the microphone about it.
The uncomfortable part is that estimation is a group skill, not a framework. Two teams can run the same Scrum playbook by the book and end up with totally different estimation cultures, because one has a shared sense of what "done" looks like and the other just argues about it. The framework is scaffolding. The conversation is the thing that actually ships work.
Five hints from somebody who has been burned
Take these with the appropriate seasoning. What worked for me has failed for other people.
-
Refine more, plan less. The quality of your estimates is mostly decided before planning starts. If you're arguing about whether something is a 5 or an 8 during planning, you skipped refinement. Go back and break the ticket in half.
-
Pick a scale and stop debating it. Fibonacci, t-shirts, pizza sizes, whatever. The scale matters less than the shared meaning your team has built around it. If a 5 means the same thing to everyone in the room, you've already won. Switching from Fibonacci to bananas won't fix a shared-meaning problem.
-
Track throughput, not velocity. Count what actually shipped to production in the last four weeks. That's your forecast input. Throw out the graphs that compare sprints or teams. They aren't telling you what you think they are.
-
Say the AI part out loud. When you estimate a ticket, say whether it's Green, Yellow, or Red. "I'll use Claude for the boilerplate and do the tricky logic by hand" is a much better input than "uh, 3." You're giving the room a realistic picture of where the time is going.
-
Be honest about the cone. Early in a project, your estimate is wrong. That's fine. Put a range on it instead of a number, and update the range as you learn. "Somewhere between six and twelve weeks" is more professional than "eight weeks," even though it sounds less confident. The people who end up trusting you are the ones who watched your early ranges hold up.
None of this is solved
Estimation is a conversation about risk and knowledge, not output. That has been true forever. AI just made it impossible to hide.
The frameworks aren't wrong. The scales aren't wrong. The rituals, done properly, aren't wrong. What breaks is applying them without questioning whether the underlying assumptions still hold. The assumption that complexity maps predictably to effort was one of those. It broke quietly, and most of the discourse about AI productivity still hasn't caught up.
The tools don't determine the culture. The conversations do. Whether your team can say "I don't know how long this will take, but here's what I know and here's what I'll know by Friday," that's the thing. Not the scale. Not the velocity dashboard.
If a framework helps you have that conversation, use it. If it gets in the way, change it. And if someone at the back of the room wants to track velocity in tokens per day, let them run the experiment. Just don't put it in the quarterly report.
We're all guessing. The best we can do is guess out loud, together, and be wrong in useful ways.
Footnotes
-
Ron Jeffries, Story Points Revisited (2019). The co-inventor of story points explains, in his own words, why teams have largely misused them, and why he wishes he could take them back. The velocity quote is from this piece. ↩
-
Steve McConnell, Software Estimation's Cone of Uncertainty. The original formalization of how estimate accuracy improves as a project progresses, and why early-phase estimates should carry explicit ranges rather than false precision. ↩
-
Google Cloud / DORA, State of AI-assisted Software Development (2025). Survey of nearly 5,000 developers. The report's central finding: AI amplifies existing organizational strengths and weaknesses rather than fixing them. ↩
-
Scrum.org community thread, How to approach story point estimation with AI dev acceleration tools. Practitioners debate whether to adjust points for AI-assisted work, drop them, or treat AI as a velocity multiplier. No consensus emerges. This, frankly, is the honest answer. ↩
Resources
Enjoyed this article?
Stay in the Loop
Get notified when I publish new articles. No spam, unsubscribe anytime.