What are the operating problems of AI agents in production?

What are the operating problems of AI agents in production? · tsukumo

The four operating problems on one page

Problem	Symptom you see	Why a bigger model won't fix it	The lever that does	Proof tool
Reliability	Confident, wrong merges	A smarter actor still has nothing checking it	Scoped permissions and review gates	WRAI.TH relay
Observability	You can't say what the agent did	A smarter actor is still a black box	Decision, action, cost, and gate traces	yoru
Context	Agent acts on stale truth	A bigger window holds more, not the right thing	Serve the canonical answer on demand	trovex
Cost	Token bill climbs, output doesn't	Cheaper tokens, same wasted work	Stop rereading the whole repo every session	trovex

Common questions

What are the main problems with running AI agents in production?

Four, and they compound: reliability (can you trust what the agent does), observability (can you reconstruct what it did), context (can you feed it the currently-correct truth), and cost (can you afford it across a fleet). They are one problem wearing four faces. A bigger model does not fix any of them, because the failures sit in the operating layer around the model, not in the model itself.

Why do AI agent pilots stall?

They stall in a loop. With no canonical context the agent acts on stale truth, so it produces unreliable output and burns tokens redoing work. With no observability no one can say why it went wrong, so trust erodes and the bill climbs while the rollout waits for a fix that never lands. The pilot dies of compounding operating problems, not of a weak model.

What makes an AI agent production-ready?

The engineering around it, not the model. Production-ready means scoped permissions so the agent cannot touch what it should not, review gates a human or a check must pass before merge, observability so every action and its cost are reconstructable, and context the agent can trust. Skip any one and the others degrade, because the four operating problems feed each other.

Can you run AI agents in production without observability?

You can start, but you cannot keep it. Application monitoring watches service health, not a non-deterministic actor's decisions, actions, cost, and whether it passed its gates. Without that record you run agents on faith, and the first incident you cannot reconstruct ends the rollout. Observability is what lets you diagnose the reliability problem instead of guessing at it.

What is the biggest cost driver for AI coding agents?

Not the model price. The bill balloons from context: rereading the repository every session, bloated prompts, and rework when the agent acted on stale truth. Multiply that across a fleet and it dominates. We measured about 60% fewer tokens per doc lookup with trovex by serving the current canonical answer on demand instead of letting agents rediscover it every time.

AI agents in production: the four operating problems that decide it

How the four fit together#

Reliability: it is the engineering around the model, not the model#

Observability: you cannot trust what you cannot reconstruct#

Context: a bigger window is not the fix#

Cost: the bill is mostly context, not the model price#

The loop that kills pilots#

How we think about it#

How we run a 9-agent growth team on wrai.th (and what broke)

Our agents are our first users. So we interviewed them, and only believed the logs

AI and Swiss secret professionnel: the three DPA terms that decide whether the secret survives

Want this running on your team?