Reliability isn't a model property you buy, it's an operating discipline you build. Demo success rates predict almost nothing: agents self-correct poorly, report false success, and degrade under load. The fix is gating on verified signals, not the agent's own word. Here's the discipline, with the evidence.
tsukumo
Short version: Reliability is not a property of the model you picked. It is a discipline you build around the agent. The uncomfortable part is that almost nothing you see in a demo tells you whether an agent will be reliable in production. Agents are bad at catching their own mistakes, they report success on work that failed, they collapse as you add tools, and they finish only a slice of genuinely hard real tasks. None of that shows up in a clean demo run. What makes the difference in production is an operating model: gate on verified outcomes, not the agent's word; keep the doer and the checker separate; measure with evals; and observe what actually happened. This page is the map of that discipline, with the evidence for each piece linked where it lives.
An operating discipline, not a model upgrade. The shift that buys you the most is to stop treating the agent's own report as a signal you can ship on. A reliable system gates on something the agent cannot fake: the verified state of the thing it was supposed to change, checked by something other than the agent that did the work. Everything below is a consequence of taking that seriously.
The reason this matters so much is that the obvious shortcuts all route the check back through the agent, and the agent is the one component you already know is unreliable.
Why doesn't a demo's success rate predict production reliability?#
Because a demo measures capability on one run, and reliability is about what happens across many runs, under variation and load. They are different axes. A model can be capable and unreliable at the same time, and a demo only ever shows you the capable half.
Start with the metric itself. A single-run pass rate, the number most demos and benchmarks lead with, measures whether the agent can do the task once, not whether it does it consistently. pass@1 measures capability, not agent reliability walks why a high single-run score and a reliable agent are not the same claim.
Then look at what happens on genuinely hard work. When agents were given real operations-research tasks with briefs, data, and constraints, the best setup finished a minority of them, and far fewer of the hard ones. The demo quietly skips that part. AI agents finished a third of the real work has the benchmark and the numbers.
No, and this is the failure that quietly breaks most agent systems. There are two versions of the mistake, and both feel responsible.
The first is self-review. Ask an agent to check its own output and it mostly waves it through, because it treats a claim wearing its own role as already vetted. Why your AI agent can't correct its own mistakes shows that the same wrong claim gets corrected far more often the moment it is relabeled as someone else's, so the failure is structural, not a reasoning gap.
The second is the false-success problem. Agents assert that a task is complete when the environment says otherwise, and they do it often: up to 75.8% of self-assessing coding-agent runs in one 2026 study. The natural fix, an LLM judge reading the trajectory, scored near a coin flip at catching it. Your AI agent says it succeeded, don't believe it has the rates and why judges miss.
What you can and can't trust as a reliability signal
Signal
Trustworthy?
Why
Agent's own success claim
No
It over-trusts its own role; reports false success
LLM judge reading the trajectory
Weak
Keys on confident language, near coin-flip at catching silent failure
A single demo or pass@1 run
No
Measures capability once, not reliability across runs
Eval on a golden task set, run repeatedly
Yes
Gradeable outcomes, surfaces variance
Verified environment state change
Yes
The agent cannot fake the thing it was supposed to change
The pattern across both failures is the same. Any check that lives inside the agent's own role, or that grades the story instead of the outcome, inherits the unreliability it was meant to catch.
With evals and observability, not vibes. Shipping an agent change because "it felt better in the demo" is how silent regressions reach production. An eval is a golden set of real tasks with gradeable outcomes, run repeatedly so you see the variance, not one lucky pass. Evaluating AI agents in production covers how to build that set and what to grade.
Measurement only helps if you can also reconstruct what happened. When agents run unsupervised and you wake up to merged changes and a token bill, observability is what lets you answer "what did they do, and why." AI agent observability is the other half of the reliability loop.
Put together, the discipline is short to state and hard to skip:
Gate on verified state, not self-report. Define success as a concrete change in the environment, and confirm it by inspecting that environment, not by asking the agent or a judge.
Separate the doer from the checker. The agent that did the work is the worst-placed thing to confirm it. Give confirmation to a different role, a separate agent, a test, an external eval.
Measure reliability, not capability. Run evals on real tasks repeatedly. Track consistency and regressions, not a single demo pass.
Observe everything. Keep an audit trail you can replay, so a bad run is diagnosable instead of mysterious.
This is the operating model we build with teams running agents in production, and the one we run on our own fleet: doers and checkers as separate roles, gates on real outcomes, evals on actual tasks, and an audit trail for every run. We did not arrive at it from a whitepaper. We arrived at it by watching confident agents report success on work that had not happened, and deciding "the agent said it was fine" was not something we would ship on.
If your agents pass their own checks and still break things in production, you do not have a model to upgrade. You have an operating model to build. That's the work we do with teams.
We map where your agent system trusts itself instead of a verified signal, what that's shipping past you, and the reliability operating model that fixes it, on your stack.
Agents that pass their own checks and still break prod?
Not a bigger model. Reliability comes from an operating discipline around the agent: gating releases on verified outcomes rather than the agent's self-report, separating the doer from an independent checker, running evals on real tasks, and keeping an audit trail of what the agent did. A capable model with no operating discipline still fails in production.
Why doesn't a demo's success rate predict production reliability?
Because a demo measures single-run capability, not reliability under repetition and stress. Agents that look strong in a demo self-correct poorly, report false success, collapse as tool counts grow, and finish only a fraction of genuinely hard real-world tasks. Single-run pass rates and reliability are different axes, and only one of them shows up in a demo.
Can an agent or an LLM judge verify its own work?
Not reliably. An agent tends to trust claims wearing its own role, so self-review misses its own errors. And an LLM judge scoring a trajectory keys on surface proxies like confident language, scoring near a coin flip at catching false success. The reliable check is an external one against the actual environment state.
How do you measure AI agent reliability?
With evals: a golden set of real tasks with gradeable outcomes, run repeatedly so you see variance, not a single happy-path run. Pair that with observability so you can reconstruct what an agent did and why, and gate releases on verified state changes rather than the agent's own success claim.