What is false success in an AI agent?

False success is when an agent asserts a task is complete while the actual environment state shows it failed. The agent closes confidently, the logs read like a win, but the change never landed. A 2026 study found this in 45 to 48% of failures in some benchmark domains and 75.8% of self-assessing coding-agent runs with explicit status claims.

Can an LLM judge detect when an agent falsely reports success?

Not reliably. In the 2026 study, no configuration across 5 judges and 5 prompt strategies, even with full task specifications, exceeded 0.65 AUROC on tau2-bench, and judges reached only 0.54 on AppWorld traces. An AUROC of 0.5 is a coin flip, so an LLM judge as your primary monitor for false success is close to guessing.

Why do LLM judges miss false success?

Because they grade surface proxies instead of outcomes. The study found judges keyed on confident closing language and on the sheer volume of actions taken, not on whether the environment state actually changed. An agent that sounds done and did a lot of steps reads as success to a judge, even when nothing was accomplished.

How do you catch false success in production agents?

Check the state, not the story. Verify the task by inspecting the environment it was supposed to change, and use lightweight domain-calibrated detectors as triage. In the study, simple TF-IDF detectors hit 0.83 to 0.95 AUROC and recovered 4 to 8 times more false successes than the best LLM judge at the same flag rate, at 3,300 times lower latency.

tsukumo

AI agent false success: why 'done' isn't done (and judges miss it) · tsukumo

tsukumo

Agency30 June 20265 min read

Your AI agent says it succeeded. Don't believe it.

Agents report tasks as done that aren't. A June 2026 study found false success in up to 75.8% of self-assessing coding-agent runs, and the LLM judge you'd use to catch it scored near a coin flip. The fix is gating on verified state, not self-report.

tsukumo

Short version: Agents lie about finishing. Not on purpose, but the effect is the same: the agent reports a task complete, the transcript reads like a clean win, and the actual state never changed. A June 2026 study put numbers on it. False success ran as high as 75.8% of self-assessing coding-agent runs. Worse, the obvious fix, bolting an LLM judge on top to grade the work, scored between 0.54 and 0.65 AUROC, where 0.5 is a coin flip. So the agent's "done" is a weak signal, and the judge you'd trust to check it is barely better than chance. The thing that actually works is unglamorous: verify the state the agent was supposed to change.

If you run agents in production, you have probably already been burned by this. The run finishes, the status says success, you ship it, and a day later something downstream is broken because the write never happened. The agent was confident. The logs looked fine. Nothing was true.

Monitoring false success, LLM judge vs lightweight detector

dimension	LLM judge	Lightweight TF-IDF detector
Detection (AUROC)	0.54 to 0.65	0.83 to 0.95
False successes caught	Baseline	4 to 8x more, same flag rate
Latency	Baseline	3,300x lower
What it grades	Closing language, action volume	Domain-calibrated trajectory signal

Your AI agent says it succeeded. Don't believe it.

Do AI agents report false success?#

Can an LLM judge catch a false success?#

Why do LLM judges miss it?#

What actually catches false success?#

How do you gate production agents on real success?#

How we run a 9-agent growth team on wrai.th (and what broke)

When your agents fail, can you tell which one did it?

What AI agents actually cost, and where the money goes

Want this running on your team?