21 June 20265 min read
AI agents finished a third of the real work. The demo skips that part.
ORAgentBench gave 14 frontier agent setups 107 expert-reviewed operations-research tasks with real briefs, data, and constraints. The best one finished 35.5%, and 20.6% of the hard tasks. The failures weren't reasoning. They were the operational discipline a demo never has to show.
Short version: the agent demo always works. It picks a clean task, runs to a green checkmark, and the room nods. The question that demo never answers is what happens on real work, with real constraints, graded by someone who knows the domain. A 2026 benchmark called ORAgentBench went and checked, on 107 expert-reviewed operations-research tasks, across 14 frontier agent setups. The best one finished about a third. On the hard tasks, a fifth. Read that before you put an agent on your critical path.
What ORAgentBench measured#
ORAgentBench (Li, Cai, Li et al., 2026) is a benchmark for a plain question its title asks directly: can LLM agents solve challenging operations-research tasks end to end? Operations research is a good stress test, because the work is unforgiving. A solution is either feasible against the constraints or it isn't, and "looks right" earns nothing.