How do I evaluate or measure an AI agent's quality?

Stop judging on vibes and build an eval: a frozen golden set of real tasks, a way to grade each outcome (a deterministic check where the answer is well-defined, a validated model-judge where it isn't), and the discipline to run it on every prompt, model, or tool change. That turns "it felt better" into "the score went up or down," so a regression is a number that dropped, not a complaint two weeks later.

Updated 19 June 2026

Go deeper: read the full write-up on the blog.

Why "it seems better" is the trap

A demo is a few cases you chose; an agent fails on the distribution you didn't. A change that helps your three favorites can quietly degrade a category you never sample, with no error, just drift nobody is measuring.

Grade what you can, judge what you can't

Use a deterministic check where the answer is well-defined (right tool called, output parses, test passes), because it never has an off day. Use an LLM-as-judge only for judgement calls, and only after you've validated it against human labels.

The trap inside the fix

An unchecked judge is automated vibes. A grader you don't validate can be biased the same direction as the agent and hand you a green dashboard over a degrading system. Validate it against human labels and re-check when you change it.

Straight answers.

How often should I run agent evals?: On every change that can move behavior (a prompt edit, a model upgrade, a new tool) like a test suite in CI. The value is catching a regression when you cause it, not in production later.
Can I use an LLM to grade the outputs?: Yes for outputs that are hard to script-check, but validate the judge against human labels first. An unverified judge replaces your vibes with confident, wrong vibes.
What goes in the golden set?: Real, representative tasks including the hard and weird ones, frozen so every run tests the same thing. Not your three demo prompts.

Keep reading.