Agency17 June 20264 min read
Evaluating AI agents in production: getting past vibes
Most teams ship agent changes on vibes: it felt better in the demo. But an agent silently regresses, and "it seems good" isn't a measurement. Evals are a golden set of real tasks with gradeable outcomes, run on every change, with the judge itself checked.
The short answer
Most teams change an agent and judge it on vibes: it felt better in the demo. That is not a measurement, and an agent silently regresses on the cases you didn't look at. Evaluating an agent in production means a golden set of real tasks with gradeable outcomes, run on every prompt, model, or tool change, so a regression shows up as a number that dropped instead of a complaint two weeks later. And if you use a model to grade, you have to check the grader, or you've just automated the vibes.

Short version: most teams change an agent and decide it's better because it felt better in the demo. That is not a measurement, and it is exactly how agents silently regress. An agent's behavior shifts across thousands of cases you never look at, so "it seems good" tells you about the three you did. Evaluating an agent means a golden set of real tasks with gradeable outcomes, run on every prompt, model, or tool change, so a regression shows up as a number that dropped instead of a support ticket two weeks later. We run agents in production to ship our own software, and the evals are what let us change them without holding our breath.