20 June 20263 min read

The AI operating-model scorecard: where your team is losing the gains

The research says AI helps or hurts depending on five operating levers, not the model. Here's a short self-scorecard for each one. Most teams are strong on a couple and quietly bleeding the gains on the rest. Your lowest score is where to fix first.

tsukumo

Short version: the research is consistent that AI helps or hurts based on how you run it, not which model you bought. We turned the five levers that decide it into a quick scorecard. Run each section honestly. Most teams find they're solid on one or two levers and bleeding the gains on the others, and the lowest score is exactly where to start.

Score each lever 0-2: 0 = not happening, 1 = partly, 2 = solid.

Lever 1: Task selection#

Where you point AI decides most of the result. Stanford measured 35-40% gains on simple, greenfield work and near-zero on the complex code most teams live in.

We've named the work AI is good at here (boilerplate, tests, migrations) and the work it isn't.
Developers aren't reaching for an agent on the gnarly core that needs deep system context.

We can say, per team, where AI is actually pointed today.

Score 0-2. Low score: you're aiming AI at the hard cases and paying for it in rework.

Lever 2: Batch size and review#

DORA tied rising AI adoption to lower delivery stability, because AI inflates change size and big batches are riskier.

There's a real cap on PR size, and AI-generated work gets split to fit it.
Review means a human understood the change, not that they clicked approve.
PR sizes have not quietly crept up since AI landed.

Score 0-2. Low score: AI is helping you ship bigger, riskier batches faster.

Lever 3: Outcome metrics#

METR found developers felt ~20% faster while measuring 19% slower. Activity metrics lie in exactly AI's favorite direction.

We track change-failure rate and rework, not "lines shipped" or "percent AI code".
Nobody on the team is reporting AI's value purely from how it feels.
If AI quietly made delivery worse, our dashboard would show it.

Score 0-2. Low score: you can't tell whether AI is helping, which usually means it isn't, yet.

Lever 4: Context quality#

ETH Zurich found stuffing an agent with a big context file made it worse and pricier; GitClear found AI driving duplication, partly because the agent can't see what to reuse.

The agent gets scoped, current context for the task, not the whole repo or a stale overview.
Our AGENTS.md (or equivalent) is short and about constraints, and it's kept current.
The agent reuses existing code instead of generating near-duplicates of it.

Score 0-2. Low score: you're paying for the agent to read the wrong things and repeat itself.

Lever 5: Codebase cleanliness#

Stanford found net AI productivity rises with test coverage, types, docs, and modularity. The cleaner the environment, the more of AI's speed survives.

Test coverage is good enough that the agent's mistakes get caught before they cost a day.
Modules and types are clear enough to give the agent a target it can hit.
We'd be comfortable pointing an agent at this codebase without wincing.

Score 0-2. Low score: your codebase is the ceiling on every AI gain, and it's set low.

Reading your score#

Add it up (max 10).

8-10: you're operating AI, not merely running it. Your gains are probably real. Tighten the weakest lever and keep measuring.
4-7: typical. Real value in places, leaking it in others. Your lowest-scoring lever is the highest-return fix, usually batch size or context.
0-3: you've adopted the tool and skipped the operating model. This is the common case behind "AI isn't delivering for us", and it's fixable, in order.

The point isn't the number. It's that the lever you scored lowest is where your AI spend is leaking right now, and it's almost always cheaper to fix than to wait for a smarter model.

Want the measured version?#

This scorecard is the five-minute read. A paid agent-ops assessment runs the same five levers against your real repository and delivery data, then ranks them by what each one is actually costing you, so you fix in order of payoff rather than guess. Same framework, real numbers. Talk to us and we'll map where your gains are leaking.

Common questions

How do I know if my team is using AI well?

Score yourself on five levers: are you pointing AI at the right tasks, keeping batches small behind real review, measuring outcomes not output, serving trusted context, and keeping the codebase clean? Strength on all five is rare. The weakest one is where your AI investment is leaking.

What is an AI operating-model scorecard?

A short self-assessment across the five operating practices the independent research ties to AI success: tasks, batches and gates, metrics, context, and code quality. It's not a maturity badge. It's a way to find the single lever costing you the most so you fix that first.

Which AI lever should I fix first?

The one you scored lowest. For most teams it's batch size or context, because both are cheap to change and give an immediate signal. Fix the weakest lever, re-measure on outcomes, then move to the next.

How is this different from an AI readiness assessment?

The scorecard is the five-minute self-version. A full assessment measures the same levers on your real repository and delivery data, then ranks them by what each is costing you. The scorecard tells you roughly where you stand; the assessment tells you the number.

Want this running on your team?

Get your assessment