What is an eval for an AI agent?

An eval is a repeatable test of an agent against a fixed set of representative tasks with a way to grade each outcome as pass or fail (or scored). You run it on every change to the agent's prompt, model, or tools, and compare the score to the last run. It turns "does this feel better" into "did the number go up or down."

Why isn't testing in the demo enough for an AI agent?

Because the demo is a handful of cases you chose, and an agent's behavior shifts across thousands you didn't. A change that improves your three demo prompts can quietly break a category you never look at. Without a fixed, representative set scored every time, you only learn about regressions when a user hits one.

Can you use an LLM to grade an AI agent's output?

Yes, LLM-as-judge is useful for outputs that are hard to grade with a script, but only if you validate the judge against human labels first. An unchecked judge can be biased, inconsistent, or wrong in the same direction as the agent. If you never verify the grader, you've automated your vibes, not replaced them.

How often should you run agent evals?

On every change that can move behavior: a prompt edit, a model upgrade, a new or changed tool. Treat the eval like a test suite in CI. The whole value is catching a regression at the moment you introduce it, not discovering it in production after it has affected real work.

Evaluating AI agents in production (beyond vibes)

tsukumo

Evaluating AI agents in production (beyond vibes) · tsukumo

Vibes vs evals

Question	Shipping on vibes	Shipping on evals
How you judge a change	It felt better in the demo	A score on a fixed golden set
When you learn about a regression	A user complains, weeks later	The run right after you cause it
Can you attribute it	No, too many changes since	Yes, to the change that dropped the score
Model-judge	Trusted blindly, if used at all	Validated against human labels
Confidence to change the agent	Low, you hold your breath	High, the number tells you

Evaluating AI agents in production: getting past vibes

Why "it seems better" is the trap#

What an eval actually is#

Grade what you can; judge what you can't#

The trap inside the fix: an unchecked judge#

Vibes vs evals#

What this means for your team#

AI agents in production: the four operating problems that decide it

Your AI agent didn't fail the deploy. It stopped itself.

How we run a 9-agent growth team on wrai.th (and what broke)

Want this running on your team?