Agency1 July 20264 min read
Your agent's success rate is one lucky run
The success number from a demo is a single lucky run. A 2026 benchmark stressed agents the way production does, with task perturbations and injected tool failures, and watched success fall from 96.9% to 88.1% on perturbation alone. Reliability is a surface you measure, not a number you quote.
Short version: The success rate you saw in the demo is one run on a clean task with nobody throwing wrenches. A 2026 benchmark threw the wrenches. It perturbed the tasks and injected the kind of tool failures production actually produces, and success fell from 96.9% to 88.1% on the perturbations alone, before a single fault was added. That gap is the part the demo never shows you. Reliability is not a number you quote off one run. It is a surface you measure across repeats, task variations, and failures, and it only appears when you stress the agent the way production will.
You have shipped on a number like 96.9% before. It looked production-ready. Then real traffic showed up, with rephrased requests, a rate-limited API, a half-returned response, and the agent that aced the demo started missing. Nothing about the model changed. The conditions did.

