You cannot budget an AI agent the way you budget a server
The same agent task can cost 30x more on one run than another, spending more tokens does not buy more accuracy, and the model itself cannot tell you the bill before it runs. A 2026 study across 8 frontier models on SWE-bench Verified put numbers on all three. The cost is not high so much as it is unpredictable.
tsukumo
Short version: a server costs what it costs. You size it, you know the bill. An AI agent does not work like that, and most teams are still budgeting it like it does. The same task can cost 30x more on one run than the next. Spending more tokens does not buy you more correct answers. And if you ask the model what a job will cost before it runs, it cannot tell you, and it guesses low. A 2026 study put numbers on all three. The problem with agent cost is not that it is high. It is that it does not hold still.
Into input, on every single call, far more than into output.
The headline first. Agentic coding tasks consume on the order of 1000x more tokens than a code-reasoning or code-chat prompt, and the cost is driven by input tokens, not the answer the model writes back. Every step, the agent resends its context: the task, the history, the files it has looked at. That resend is the bill. We have written before about why coding agents cost so much and what rereading your docs costs, so I will not relitigate the drivers here. This piece is about a harder problem that sits on top of them: even once you know where the tokens go, you still cannot predict how many there will be.
“8 frontier LLMs analyzed on SWE-bench Verified. Finds agentic tasks consume ~1000x more tokens than code chat, input-driven; that the same task varies up to 30x between runs; that accuracy peaks at intermediate cost and saturates; and that models cannot predict their own token usage (correlation up to 0.39), underestimating systematically.”
Because the agent takes a different path every time, and you pay for the path, not the answer.
This is the number that should change how you budget. On the same SWE-bench Verified task, runs differed by up to 30x in total tokens. Not 30 percent. Thirty times. The agent that wanders, backtracks, rereads a file it already read, and tries three approaches before the one that works costs an order of magnitude more than the agent that walks straight to the fix, on the identical task.
So a per-task cost estimate is a lie of omission. It is one sample from a wide, heavy-tailed distribution. If you budget on the average, the tail runs eat your month. If you budget on the worst case, you over-provision by 30x. The honest unit of agent cost is not a number. It is a distribution, and you have to treat it like one.
up to 30x
variation in total tokens between runs of the SAME task
The agent's path, not the task's difficulty, sets the cost.
No. And this is the part that should stop the "just give it a bigger budget" reflex cold.
Higher token usage did not translate into higher accuracy. Accuracy often peaked at intermediate cost and then saturated. Past that point, the extra tokens bought more exploration and more rereading, not more correct answers. The expensive runs were not the smart ones. They were frequently the lost ones, circling.
That inverts the intuition most budgets are built on. We tend to assume cost tracks effort and effort tracks quality, so a pricier run must be a better run. Here the opposite shows up: beyond a knee in the curve, cost and correctness come apart. This is the same shape as the reliability story, where more capability does not mean more reliability. Spending more is not a quality lever. It is just a bigger invoice with the same answer at the bottom.
“The expensive runs are not the smart ones. Past the knee, cost and correctness come apart, and you are paying for the agent to circle.”
No. It cannot, and it thinks the job is cheaper than it is.
The study asked the models to predict their own token usage before running. They were bad at it: correlations with actual spend no better than 0.39, and a systematic bias toward underestimating. The agent that is about to spend a fortune does not know it is about to, and if you ask, it lowballs you.
There is a second crack in the forecast. Task difficulty as rated by human experts only weakly aligned with what the agents actually spent. The job that looks hard to a senior engineer is not reliably the job that costs the most tokens. So neither the model's self-estimate nor your team's gut read of complexity gives you a number you can put in a plan. The forecast has to come from somewhere else.
You stop budgeting a point and start governing a distribution. Four moves.
Budgeting a server vs. governing an agent
Server budgeting reflex
What an agent needs instead
Estimate one cost per task
Budget the distribution, plan for the tail
Bigger budget for better results
Cap spend; past the knee it buys nothing
Trust the spec / the quote
Trust measured logs; the self-estimate is 0.39
Watch the monthly invoice
Watch tokens against accuracy, per run
First, budget the distribution, not the average. Provision for the tail you can tolerate and decide in advance what happens beyond it. Second, set a hard token cap per run, because the 30x tail is where a single job quietly eats the month, and the model will not warn you. Third, attack the input side, since that is where the tokens are: the context you resend every call is the dominant cost, and resolving a question to one canonical answer instead of rereading is the lever that moves it. Fourth, measure tokens against accuracy, not against the invoice, so the moment spend climbs without correctness climbing with it, you can see it and stop paying for it.
None of this is possible if you cannot see per-run spend next to per-run outcome. That is the actual gap on most teams: the invoice is monthly and aggregate, the waste is per-run and individual, and the two never get put side by side. We built yoru to put them side by side, so the runaway run and the spend-without-accuracy run show up while they are happening, not at the end of the billing cycle.
Most cost conversations we walk into start at the model price, which is the small, stable, knowable number. The big, volatile, unknowable number is everything the agent does between the prompt and the answer, and that is the one nobody is watching. The fix is rarely a cheaper model. It is a token cap so the tail cannot hurt you, input discipline so the dominant cost stops growing, and observability so you are measuring spend against correctness instead of against the invoice.
One caveat I will put my name to. This is one study, on coding agents, on one benchmark, and the exact multiples will move as others run it. The shape will not. Agent spend is stochastic, it is non-monotonic with quality, and the model cannot forecast it. Those three hold regardless of which model tops next month's board, and they are enough to break a budget built like a server's.
If your agent bill is climbing and nobody can tell you which runs are buying correctness and which are just circling, that is the work we do.
Part of our proof index, the studies and benchmarks behind what we claim.
We make per-run spend visible next to per-run accuracy, cap the runaway tail, and cut the input cost underneath it.
Because agent spend is stochastic, not deterministic. A 2026 study of 8 frontier models on SWE-bench Verified found the same task can vary up to 30x in total tokens between runs. The agent explores, backtracks, and rereads differently each time, so a single per-task cost estimate is a point on a wide distribution, not a number you can put in a budget.
Do more tokens make an AI agent more accurate?
No. The same study found accuracy often peaks at intermediate cost and saturates at higher cost. Past a point, extra tokens buy more exploration and rereading, not more correct answers. Spending more is not a reliability strategy; it is just a bigger bill.
Can an AI agent predict its own token cost before running?
Poorly. Frontier models asked to forecast their own token usage managed correlations no better than 0.39 with what they actually spent, and systematically underestimated. You cannot ask the agent for a quote and trust it. The number has to come from measured logs.
How do you control AI agent costs in production?
Budget for the distribution rather than an average, set a hard token cap so a single runaway run cannot blow the month, attack the input side (context resent every call is the dominant cost), and track tokens against accuracy so you stop paying for spend that no longer buys correctness.