Your developers feel faster with AI. The clock disagrees.
A 2025 randomized controlled trial from METR found experienced developers were about 19% slower with AI tools, while they believed they were faster. The gap between the feeling and the measurement is the real risk, and it's a management problem before it's a tooling one.
tsukumo
Short version: the most rigorous study we have on AI and developer speed found the tools made experienced engineers slower, and those same engineers were sure they'd been sped up. If you're greenlighting AI spend off the back of "the team loves it and feels faster," you're buying the feeling, not the result. That's a manageable problem. But you can't manage a number you're not looking at.
In 2025 the independent lab METR ran the experiment most people skip: a randomized controlled trial on real software work. Sixteen experienced open-source developers, 246 real issues on repositories they maintain and know cold. Each task was randomly assigned to allow AI tools or not. The headline:
Developers were about 19% slower with AI than without it, on their own codebases.
, and after living through the slowdown, still believed AI had made them about 20% faster.
They predicted a 24% speedup going in
The tooling was current for the time: mostly Cursor Pro with frontier Claude models.
Sit with the second point, because it's the one that matters for your org. The measured effect and the felt effect didn't just differ in size. They pointed in opposite directions. The people doing the work were confidently, sincerely wrong about whether it was helping.
This isn't developers being dishonest or dumb. It's the shape of the work. Typing is rarely the bottleneck for a senior engineer; deciding what's correct is. AI moves the cost around: you do less blank-page typing and more reading, judging, correcting, and re-prompting a draft that looks right and sometimes isn't. Producing a plausible draft feels like progress, so the review-and-repair tax that follows gets quietly discounted.
On a codebase you know well, with standards you actually enforce, that tax is highest. The suggestion has to clear a high bar, and checking whether it does can cost more than writing the thing yourself. So you get the worst combination for a manager: a real slowdown that everyone on the team experiences as a speedup.
This felt-versus-measured gap is exactly what an agent-ops assessment is built to surface before you scale spend on a vibe.
Here's where most rollouts go wrong. The success metric becomes adoption and sentiment. Seats activated. "Percent of code written by AI." A survey where everyone says they're faster. Every one of those goes up while the thing you care about, shipping good software sooner, can go sideways or down. METR isn't an isolated reading, either: Google's 2024 DORA report found that rising AI adoption was associated with lower delivery stability, mostly because it's so easy to generate bigger, riskier change sets.
Two independent, serious measurements, pointing the same way: AI makes it trivial to produce more, and producing more is not the same as delivering better. If your dashboard only tracks production and sentiment, it will look like a triumph the whole time the delivery metrics rot.
One caveat, and it cuts against us too: METR's study is sixteen expert developers on mature repos they know intimately. That's the hardest case for AI to win, and the easiest place for a careless suggestion to hurt. A junior on a greenfield service, or a senior pointed at boilerplate, can land very different numbers. That's the point, not a loophole. The result isn't "AI is slow." It's "the result depends entirely on the work and the operating model, so measure it instead of assuming."
The study quietly tells you where the wins are. Among the slowdown factors METR named: prompts that were too thin, unfamiliarity with the tools, and standards the suggestions couldn't meet. None of those is a property of the model. All of them are operating choices.
Point AI at the work where it pays. Boilerplate, tests, migrations, the unglamorous high-volume stuff, not the gnarly core a senior already holds in their head. Task selection is most of the effect.
Measure the outcome, not the activity. Watch throughput on real work at held quality: cycle time, change-failure rate, rework. Self-reported speed is the one signal METR proved you can't trust. We wrote the longer version of this in measuring AI's impact honestly.
Fix the layer under the model. A lot of the slowdown is the agent working from the wrong context. This is the one place we have a hard number: trovex cuts roughly 60% of the tokens per lookup by serving the currently-correct context instead of stuffing the window. Less to read wrong, less to repair.
Stop reporting adoption as a win. Seats and "percent AI code" measure activity. Replace them with delivery metrics you already collect.
Run the cheap version of METR's experiment. Take a batch of real tickets, let half use AI and half not, and compare time-to-merge and rework. The method is reproducible on your own repo in a sprint, and the answer is often uncomfortable.
Believe the clock over the survey. When the feeling and the metric disagree, the metric is the one you ship.
We run agent fleets in production to build our own software, so we've felt the trap from the inside: the draft lands, it looks done, and the hour you spend making it actually correct disappears from memory. The teams that get a real speedup aren't the ones with the most enthusiasm. They're the ones who measured, found the slow spots, and fixed the operating model around the tools. The model is maybe 10% of a working setup. The other 90% is the part METR just put a number on.
If your team is sure AI is helping and you'd like to actually know, that's the engagement. Talk to us about your setup.
Not by default. In METR's 2025 randomized trial, experienced developers were about 19% slower with AI on repositories they knew well, even though they felt faster. AI can produce real speedups, but they depend on the task, the codebase, and how the team operates the tools, not on adoption alone.
What did the METR study find about AI and developer productivity?
METR ran a randomized controlled trial with 16 experienced open-source developers across 246 real issues. Allowed to use AI tools, they took about 19% longer to finish, yet expected a 24% speedup beforehand and still believed AI had sped them up by ~20% afterward. The measured result and the perceived result pointed in opposite directions.
Why do developers feel faster with AI when they're slower?
Generating a plausible-looking draft feels like progress, so the effort of reading, correcting, and re-prompting it gets discounted. On a codebase you know well, with high quality standards, that review-and-repair loop can cost more time than it saves, while still feeling productive moment to moment.
How do you actually make AI speed up your developers?
Point it at the work where it pays off, measure throughput on real outcomes at held quality instead of self-reported speed, and fix the operating layer: trusted context, review gates, and observability. The study's slowdown is an operating result, not a verdict on the tools.