Is pass@1 enough to evaluate AI agents?

No. pass@1 measures whether an agent can complete a task once, on a short task. It says nothing about whether the agent stays reliable as tasks get longer. Khanal, Tao and Zhou (2026) call pass@1 structurally blind to long-horizon degradation, measured across 23,392 episodes. For production you need reliability measured over the horizon, not a single-attempt capability score.

How do you measure agent reliability?

Over the horizon, not at one point. The Beyond pass@1 framework (2026) proposes a Reliability Decay Curve (reliability against task length), a Variance Amplification Factor (how outcome spread widens), a Graceful Degradation Score (soft versus catastrophic failure), and a Meltdown Onset Point (where it stops coping). You also need observability on the running agent to see the slide before a user does.

Do memory scaffolds help long-horizon agents?

In the study, no. Khanal, Tao and Zhou (2026) found naive memory scaffolds universally harmed long-horizon performance across every model tested. Piling history into the context window is context bloat, which drags the model toward stale answers. The fix is a different shape of memory: one canonical answer retrieved on demand, a pointer to the current doc, instead of a transcript that grows each step.

What is the difference between agent capability and reliability?

Capability is whether an agent can do a task at all. Reliability is whether it keeps succeeding as the task runs long and repeats. They are separate axes. A model that tops a capability benchmark can rank low on reliability, because long tasks compound errors and fill the context window with stale history. pass@1 reads only capability, which is why the 2026 reliability study argues for measuring reliability on its own.

Why is pass@1 the wrong metric for AI agents?

tsukumo

Why is pass@1 the wrong metric for AI agents? · tsukumo

What each metric actually sees

Question	pass@1 / capability	Reliability over the horizon
Did it work once	yes	yes
Does it hold at 30 steps	invisible	measured (decay curve)
Is the failure soft or catastrophic	invisible	measured (degradation score)
Where does it break	invisible	measured (meltdown onset)

pass@1 measures capability, not agent reliability

Why is pass@1 a bad metric for AI agents?

What's the difference between capability and reliability?#

Why do agents degrade on long tasks?#

How do you measure agent reliability?#

Does more memory help?#

How we think about it#

How we run a 9-agent growth team on wrai.th (and what broke)

We benchmarked the token cost of rereading docs on our own repo

Most agents hide it when their sources disagree

Want this running on your team?