Does a demo's success rate predict an agent's production reliability?

No. A demo measures one run on a clean task. A 2026 benchmark applied semantically-equivalent task perturbations and watched success drop from 96.9% to 88.1% before injecting any tool faults. The headline number and the production number are different measurements, and only the stressed one tells you about reliability.

What does it mean to measure agent reliability?

It means measuring a surface, not a point. The 2026 ReliabilityBench framework scores three things: consistency across repeated runs, robustness to task wording changes, and fault tolerance under injected tool failures like timeouts and rate limits. Correctness is judged by the end state, not by whether the output text looks right.

What breaks AI agents under production stress?

Injected tool and API faults, and the benchmark found rate limiting the most damaging of them. Real environments produce timeouts, rate limits, partial responses, and schema drift, none of which appear in a clean demo. An agent that never met a 429 in testing meets plenty in production.

Does a bigger model make an agent more reliable?

Not automatically. In the benchmark, Gemini 2.0 Flash reached reliability comparable to GPT-4o at much lower cost, and the agent architecture mattered too: ReAct held up better than Reflexion under combined stress. Reliability came from how the agent was built and tested, not only from the size of the model.

tsukumo

Measuring AI agent reliability under stress (demo rates lie) · tsukumo

tsukumo

Agency1 July 20264 min read

Your agent's success rate is one lucky run

The success number from a demo is a single lucky run. A 2026 benchmark stressed agents the way production does, with task perturbations and injected tool failures, and watched success fall from 96.9% to 88.1% on perturbation alone. Reliability is a surface you measure, not a number you quote.

tsukumo

Short version: The success rate you saw in the demo is one run on a clean task with nobody throwing wrenches. A 2026 benchmark threw the wrenches. It perturbed the tasks and injected the kind of tool failures production actually produces, and success fell from 96.9% to 88.1% on the perturbations alone, before a single fault was added. That gap is the part the demo never shows you. Reliability is not a number you quote off one run. It is a surface you measure across repeats, task variations, and failures, and it only appears when you stress the agent the way production will.

You have shipped on a number like 96.9% before. It looked production-ready. Then real traffic showed up, with rephrased requests, a rate-limited API, a half-returned response, and the agent that aced the demo started missing. Nothing about the model changed. The conditions did.

The demo environment vs the one production runs in

Condition	Demo / benchmark default	Production reality
Task phrasing	One clean wording	Every paraphrase users send
Tool latency	Instant	Timeouts, slow responses
Tool availability	Always up	Rate limits (429), the most damaging fault here
Tool responses	Complete and well-formed	Partial responses, schema drift

Your agent's success rate is one lucky run

Does a demo's success rate predict production reliability?#

What does it actually mean to measure reliability?#

What breaks agents under production stress?#

Does a bigger model make an agent more reliable?#

How do you stress-test an agent before production?#

How we run a 9-agent growth team on wrai.th (and what broke)

What AI agents actually cost, and where the money goes

How to make AI agents reliable in production

Want this running on your team?