Is more context always better for an AI agent?

No. On a 50-task tool-using benchmark, keeping the full conversation history scored 71%, while pruning to the last five tool calls and summarizing the rest scored 91.6%. The same configuration used roughly a third of the tokens. More context added noise the model had to re-read every turn, not accuracy.

Does a bigger context window make an agent more accurate?

Not by itself. A bigger window is capacity, not selection. In the 2026 benchmark the full-history run fit inside the window at 1,480,996 tokens and still lost to a pruned run a third its size. What raised accuracy was choosing which context to keep, not having room to keep all of it.

What is context engineering for agents?

Context engineering is deciding what an agent reads at each step instead of feeding it everything. In practice that means pruning stale tool output, keeping recent calls, summarizing the rest, and serving one canonical document per query. The 2026 benchmark shows it as the difference between 71% and 91.6% task completion.

How much can pruning and summarizing context save?

In the benchmark, full history cost 1,480,996 tokens and 14.56 hours. Pruning plus summarization cost 553,374 tokens and 5.79 hours, while scoring higher (91.6% vs 71%). That is roughly a third of the tokens and under half the runtime, with better accuracy, not a tradeoff against it.

tsukumo

More context is making your AI agent worse (measured) · tsukumo

tsukumo

Agency29 June 20265 min read

More context is making your AI agent worse

The reflex is to give an agent more context when it struggles. A June 2026 benchmark shows the opposite: keeping the full conversation history scored 71%, while pruning and summarizing hit 91.6%, on a third of the tokens. Selecting context beats hoarding it.

tsukumo

Short version: When an agent starts getting things wrong on a long task, the instinct is to give it more context. A June 2026 benchmark ran that instinct head-on. Keeping the full conversation history scored 71% on a 50-task workload. Pruning to the last five tool calls and summarizing the rest scored 91.6%, while spending a third of the tokens and under half the runtime. So more context did not buy accuracy. It bought a slower, more expensive agent that was also wrong more often. Choosing what to keep is the actual job.

You have probably felt this without naming it. An agent runs fine for the first few steps, then somewhere past the tenth tool call it starts repeating itself, re-reading output it already saw, or confidently acting on a number from twenty messages ago that no longer holds. The reflex is to widen the window and pour in more history so it "remembers everything." The new research says that reflex is backwards.

Full history vs prune-and-summarize, on the same 50 tasks

dimension	Full history	Prune + summarize
Task completion	71.0%	91.6%
Tokens consumed	1,480,996	553,374
Runtime	14.56 hours	5.79 hours
What you're paying for	Re-reading stale context	Acting on relevant context

More context is making your AI agent worse

Does keeping full context improve an agent's accuracy?#

Why does more context make an agent worse?#

Is a bigger context window the fix?#

What does the wrong default cost?#

How do you actually manage context for long-horizon agents?#

How this changes the way you build an agent system#

How we run a 9-agent growth team on wrai.th (and what broke)

Why your AI agent can't correct its own mistakes

Our agents are our first users. So we interviewed them, and only believed the logs

Want this running on your team?