AI agents in production: the four operating problems that decide it
Agents rarely stall on the model. They stall on four operating problems that compound: reliability, observability, context, and cost. Fix them as a set or none of them holds.
tsukumo
Short version: the reason agents stall in production is almost never the model. It is four operating problems that compound: reliability, observability, context, and cost. Can you trust what the agent does, can you see what it did, can you feed it what is actually true, and can you afford to run it at fleet scale. They are one problem wearing four faces. Fix them as a set or none of them holds. This post is the map: the connective tissue between them, and where each one goes deep.
The seductive read is that the next model release closes the gap. It does not. A better model is a better actor inside the same broken operating model, and a better actor with no review gate, no audit trail, and stale context just makes confident mistakes faster.
“The four operating problems are one problem wearing four faces: can you trust the agent, can you see what it did, can you feed it the truth, and can you afford it at scale.”
They are not a checklist of separate fixes. They feed each other.
Bad context makes an agent both unreliable and expensive: it acts on stale truth, so the output is wrong, and it burns tokens rediscovering what it should have been handed. Without observability you cannot even diagnose the reliability problem, because you cannot reconstruct what the agent did or why. And cost is the tax you pay for the other three being unsolved: every reread, every reworked PR, every untraceable incident shows up on the token bill, multiplied across the fleet.
Picture the most ordinary version of this. An agent opens a PR against a module whose interface changed last week. The change is not in what the agent was handed, so it writes against the old signature. The output looks plausible, passes a shallow check, and merges. Three things just happened at once: the agent was unreliable (wrong code), it was expensive (the tokens to write it, plus the tokens to fix it later), and unless you had a trace you will spend an afternoon arguing about whether the model is dumb. The model was fine. The system around it was missing.
So you do not get to pick one. The table below is the whole argument on one page, then each section below it goes deep.
The four operating problems on one page
Problem
Symptom you see
Why a bigger model won't fix it
The lever that does
Proof tool
Reliability
Confident, wrong merges
A smarter actor still has nothing checking it
Scoped permissions and review gates
WRAI.TH relay
Observability
You can't say what the agent did
A smarter actor is still a black box
Decision, action, cost, and gate traces
yoru
Context
Agent acts on stale truth
A bigger window holds more, not the right thing
Serve the canonical answer on demand
trovex
Cost
Token bill climbs, output doesn't
Cheaper tokens, same wasted work
Stop rereading the whole repo every session
trovex
Reliability: it is the engineering around the model, not the model#
Reliability is not a bigger model. It is the scaffolding you build so a non-deterministic actor cannot do damage: scoped permissions so an agent can only touch what its task needs, review gates that a human or a check must pass before anything merges, and context the agent can actually trust. Take those away and a smarter model just produces a more convincing wrong answer.
Here is the connective part. Reliability is downstream of the other three. An agent fed stale context is unreliable by construction, no matter how good the model is. An agent you cannot observe is unreliable in practice, because you cannot prove it did the right thing or catch when it did not. Reliability is the symptom everyone notices first, but its causes live in context and observability.
The right mental model is blast radius. A single agent with broad write access and no gate is one bad inference away from a wide mess; the same agent scoped to its task, with a check it can fail, is boring even when it is wrong. Reliability is not the absence of mistakes. It is making mistakes cheap and catchable.
Observability: you cannot trust what you cannot reconstruct#
Application monitoring tells you the service is up, the latency is fine, and the error rate is green. It tells you nothing about whether a non-deterministic actor made a good decision. It does not record what the agent decided, which actions it took, what those actions cost, or whether it passed the gates it was supposed to. That is a different category of thing to watch.
This is the problem that makes the other three diagnosable or not. Without a record of what each agent did and why, the reliability problem is unfalsifiable: you cannot tell a model failure from a context failure from a permissions failure, so you guess, and you guess wrong. Observability is the microscope you point at the rest.
The question a trace has to answer is specific. What did this agent decide, what did it actually do, which tools did it call, what did the run cost, and did it pass the gate it was supposed to. Service dashboards answer none of those, because they were built for code that does the same thing every time. An agent does not. The first incident you cannot replay is the one that ends the rollout, and you find out which incident that is at the worst possible moment.
We built yoru because we needed to see what our own agents did, and application dashboards could not show us. The condensed version lives at AI agent observability.
Agents lose the thread for a structural reason: a real repository does not fit in a context window, and a bigger window does not fix it. Stuffing more in does not mean the agent has the one thing that is true right now; it means it has more to wade through and more ways to anchor on something stale. The fix is not size. It is serving the current canonical answer on demand, so the agent works from what is true today instead of reconstructing it from fragments.
Context is the root that the other three grow from. Stale context is the most common cause of unreliable output. It is the single largest line on the cost bill. And when context is wrong, the observability trace just shows you, in high resolution, an agent confidently doing the wrong thing.
This is the one place we have a hard first-party number. We measured about 60% fewer tokens per doc lookup with trovex by serving the currently-correct slice for the task instead of letting every agent rediscover it. The short answer: how AI agents remember context.
Cost: the bill is mostly context, not the model price#
The token bill rarely balloons because the per-token price went up. It balloons because the agent rereads the repository every session, carries a bloated context it does not need, and redoes work it got wrong the first time because it was acting on stale truth. The model price is the part you stare at. The waste is everywhere else, and it is multiplied across every agent in the fleet.
So cost is not really its own problem. It is the meter on the other three. Unsolved context shows up as rereads and rework. Unsolved reliability shows up as discarded PRs. Unsolved observability shows up as the hours you spend guessing instead of the minutes you would spend reading a trace. Drive the token bill down and what you are actually doing is fixing the upstream three.
Now watch the four compound into one failure, because this is why pilots stall and rollouts quietly die. It runs as a loop.
No canonical context, so the agent acts on stale truth. Acting on stale truth, it produces unreliable output and burns tokens redoing the work. With no observability, no one can tell whether the failure was the model, the context, or the permissions, so it does not get fixed. Trust erodes, costs climb, and the rollout waits for a model upgrade that was never the problem. The pilot does not fail loudly. It just never earns its way onto the critical path.
The cruel part is that each problem hides the next. Fix reliability with more review and the cost problem gets worse, because now humans are in every loop. Fix cost by cutting context and reliability drops, because the agent knows less. The only move that compounds in your favor is fixing the root, context, while you instrument the rest so you can see what is happening. Pull one lever in isolation and you just move the failure somewhere you are not looking.
You break the loop by treating the four as one system, not four tickets. That is the shape of the work in AI in production consulting, and the failure pattern itself is the subject of why AI pilots stall.
We build and ship our own software with agent fleets, so we have hit every one of these in production: the agent that merged a confident mistake, the incident we could not reconstruct, the context that was stale before the run started, the token bill that climbed while output did not. The model is the part we change least. The grounding, the gates, the traces, and the canonical context are the work, and they are what turn an agent from a demo into something you trust on real systems.
If your agents work in the demo and stall in production, the gap is almost certainly in these four, and it is almost certainly more than one of them.
We map where your agents will help and where they will quietly stall, across reliability, observability, context, and cost.
What are the main problems with running AI agents in production?
Four, and they compound: reliability (can you trust what the agent does), observability (can you reconstruct what it did), context (can you feed it the currently-correct truth), and cost (can you afford it across a fleet). They are one problem wearing four faces. A bigger model does not fix any of them, because the failures sit in the operating layer around the model, not in the model itself.
Why do AI agent pilots stall?
They stall in a loop. With no canonical context the agent acts on stale truth, so it produces unreliable output and burns tokens redoing work. With no observability no one can say why it went wrong, so trust erodes and the bill climbs while the rollout waits for a fix that never lands. The pilot dies of compounding operating problems, not of a weak model.
What makes an AI agent production-ready?
The engineering around it, not the model. Production-ready means scoped permissions so the agent cannot touch what it should not, review gates a human or a check must pass before merge, observability so every action and its cost are reconstructable, and context the agent can trust. Skip any one and the others degrade, because the four operating problems feed each other.
Can you run AI agents in production without observability?
You can start, but you cannot keep it. Application monitoring watches service health, not a non-deterministic actor's decisions, actions, cost, and whether it passed its gates. Without that record you run agents on faith, and the first incident you cannot reconstruct ends the rollout. Observability is what lets you diagnose the reliability problem instead of guessing at it.
What is the biggest cost driver for AI coding agents?
Not the model price. The bill balloons from context: rereading the repository every session, bloated prompts, and rework when the agent acted on stale truth. Multiply that across a fleet and it dominates. We measured about 60% fewer tokens per doc lookup with trovex by serving the current canonical answer on demand instead of letting agents rediscover it every time.