loading
Loading.loading
Loading.Stop judging on vibes and build an eval: a frozen golden set of real tasks, a way to grade each outcome (a deterministic check where the answer is well-defined, a validated model-judge where it isn't), and the discipline to run it on every prompt, model, or tool change. That turns "it felt better" into "the score went up or down," so a regression is a number that dropped, not a complaint two weeks later.
A demo is a few cases you chose; an agent fails on the distribution you didn't. A change that helps your three favorites can quietly degrade a category you never sample, with no error — just drift nobody is measuring.
Use a deterministic check where the answer is well-defined (right tool called, output parses, test passes), because it never has an off day. Use an LLM-as-judge only for judgement calls, and only after you've validated it against human labels.
An unchecked judge is automated vibes. A grader you don't validate can be biased the same direction as the agent and hand you a green dashboard over a degrading system. Validate it against human labels and re-check when you change it.
or have us build it — same capability, the other door