What the research actually says about AI coding agents (2026)
Five independent studies, from METR, Stanford, Google's DORA, GitClear, and ETH Zurich, converge on one finding: AI coding tools don't improve software on their own. Adoption is not improvement. The gains are real but conditional on the operating model around the tools. Here's the evidence, with sources.
tsukumo
Short version: strip out the vendor decks and the LinkedIn euphoria, and look at what independent researchers actually measured in 2024 and 2025. Five serious studies, five different angles, one uncomfortable agreement: adopting AI coding tools did not, on average, make software teams faster or more stable. It usually made them a little worse, in measurable ways. Not because the tools are bad. Because volume without discipline is a tax, and AI is very good at volume.
Here's the evidence, with sources, and the pattern underneath it.
This is a living page. We add each new independent study as it lands, so the URL stays current instead of going stale. Tracking five studies as of June 2026.
Each study looks at a different layer, productivity, delivery, code quality, context, codebase fit, and lands in the same place. AI is a multiplier on your operating model. Good discipline, and it compounds the good. Weak discipline, and it compounds the weak, faster. The tool doesn't set the sign. You do.
That reframes the whole "is AI worth it" question. The honest answer the data supports is: it depends entirely on what you do around it, so measure, don't assume. Below, each study and what it actually shows.
Productivity: METR found experienced devs slower, and sure they were faster#
The independent lab METR ran a randomized controlled trial in 2025: 16 experienced open-source developers, 246 real issues on repos they know well, each task randomly assigned AI or no AI. Result: about 19% slower with AI. The twist that matters for managers: the same developers predicted a 24% speedup and, after the slowdown, still felt AI had sped them up by ~20%. Felt speed and measured speed disagreed.
Codebase fit: Stanford found AI helps clean code most#
Stanford's study of over 100,000 developers across 600+ companies split the gains by where the work happens. On greenfield, low-complexity code: 35-40%. On complex, brownfield code, where most teams actually live: single digits, 0-10%. Some teams went net-negative as rework ate the speedup, and one enterprise case saw a 2.5x rise in rework. What separated the winners was codebase cleanliness: test coverage, types, docs, modularity.
Delivery: DORA tied more adoption to less stability#
Google's 2024 DORA report, the largest ongoing study of software delivery, found that a 25% increase in AI adoption was associated with an estimated 7.2% decrease in delivery stability and a 1.5% drop in throughput. The mechanism DORA names is batch size: AI makes producing code cheap, change sets get bigger, and big batches have always been riskier.
The full breakdown, and why removing the friction that kept your PRs small is the actual problem, is in AI makes it easy to ship more.
GitClear analyzed 211 million lines of code across five years. Their 2024 data: copy-pasted lines rose from 8.3% (2020) to 12.3%, while "moved" (refactored) lines fell from 24.1% to 9.5%. For the first time, copy-pasted code exceeded refactored code. Duplicated blocks rose sharply, and the share of code revised within two weeks of commit climbed too, a churn signal.
Read plainly: AI makes it easier to paste a new version than to refactor the old one, and teams are taking the easy path. That's a maintainability bill that arrives later, quietly.
Context: ETH Zurich found naive context files backfire#
The instinct when an agent struggles is to write it a longer AGENTS.md. ETH Zurich benchmarked that and found LLM-generated context files cut task success by about 3% and raised cost over 20%, versus no repo context at all. More context, worse and pricier behavior. Only short, human-written constraint files helped, and only a little.
The lesson, and why the fix is served context rather than stuffed context, is in more context isn't better.
The common thread: it's an operating-model problem#
Line the five up and the shared cause is obvious. METR's slowdown came from review-and-repair on mismatched suggestions. Stanford's vanishing gains came from complex code the model couldn't fit. DORA's instability came from oversized batches. GitClear's duplication came from paste-over-refactor. ETH's failures came from drowning the model in context. None of those is a property of the model. Every one is a choice about how the team runs the tool.
So the teams getting real gains aren't using different models. They're running a different operating model:
Task selection. Point AI at the high-volume, low-context work, not the gnarly core a senior holds in their head.
Small batches and real gates. Cap change size, and make review mean understanding the change, not waving it through.
Outcome metrics. Track change-failure rate and rework, the numbers that survive AI, not "lines shipped" or "percent AI code."
Trusted context. Serve the currently-correct slice instead of stuffing the window. This is the one place we have a hard number: trovex cuts roughly 60% of the tokens per lookup by serving the right context, which means smaller, more correct changes out.
You don't need to run any of these studies to act on them. Look at your own PR sizes, your duplication trend, and your change-failure rate since AI landed. If the volume went up and the quality signals drifted down, you've reproduced the research on your own repo, and you know where the work is.
The pitch says AI replaces the discipline. The data says AI raises the stakes on it.
We run agent fleets in production to build our own software, so we've paid for each of these lessons before there was a paper to cite. The model is maybe 10% of a working setup. The other 90%, task selection, batch discipline, gates, trusted context, is the part five independent teams of researchers just finished measuring. That 90% is also, not coincidentally, the work.
If your team adopted AI and the results don't match the promise, that gap is exactly what we fix. Talk to us about your setup.
Not by default. Across five independent 2024-2025 studies, AI adoption tracked with slower expert developers, lower delivery stability, more duplicated code, and shrinking gains on complex codebases. The studies don't say AI is useless; they say results depend on how teams operate the tools, so the improvement has to be engineered, not assumed.
What does the METR study say about AI and developer speed?
METR's 2025 randomized controlled trial found 16 experienced open-source developers were about 19% slower on their own repositories when using AI, while believing they were ~20% faster. The measured effect and the felt effect pointed in opposite directions.
Does AI hurt software delivery and code quality?
It can, when discipline slips. Google's 2024 DORA report associated a 25% rise in AI adoption with a 7.2% drop in delivery stability, driven by larger change sets. GitClear found copy-pasted lines rose from 8.3% to 12.3% of code (2020-2024) and refactoring fell sharply. Both effects come from shipping more, faster, without tighter review.
So should teams avoid AI coding tools?
No. The same research points to where AI pays off: the right tasks, small reviewable batches, real gates, and context the agent can trust. The evidence is an argument for a better operating model, not for avoiding the tools.