Context engineering for AI agents: less, but the right context
Context engineering is deciding what an agent reads at each step instead of feeding it everything. Across 2026 studies the pattern holds: more context usually makes agents slower, costlier, and less accurate. A bigger window doesn't fix it. Here's the discipline, and where each piece fits.
tsukumo
Short version: Context engineering is deciding what an agent reads at each step instead of handing it everything. The 2026 evidence keeps landing on the same counterintuitive result. More context usually makes an agent slower, costlier, and worse, and a bigger window does not rescue it. The job is to select the right context and serve the single current answer per query, then treat that selection as part of your architecture. This page is the map: what the discipline is, why the obvious fixes fail, what the wrong default costs, and how to actually serve context, with the evidence for each piece linked where it lives.
Most teams discover this backwards. An agent starts missing things on a long task, so you give it more memory, more files, a bigger window. It gets slower and no more reliable. The instinct that feels responsible is the one quietly degrading the system. Below is the discipline that replaces it.
Context engineering is the practice of choosing, at every step, what an agent reads, instead of defaulting to "give it everything and hope." It covers what you preload, what you carry forward from earlier turns, what you summarize, what you drop, and how you fetch a fact when the agent needs one. The unifying idea is simple. Context is a budget you spend on each step, not a log you keep appending to. The agent's job at every step is to read the least context that still answers the question.
That reframing is the whole shift. Treat context as something the system decides on, and most of the common failure modes turn from mysterious into predictable.
Does giving an AI agent more context make it better?#
Usually not, and now there is direct measurement. On a 50-task tool-using benchmark, keeping the full conversation history finished 71% of tasks. Pruning to the last five tool calls and summarizing the rest reached 91.6%, on roughly a third of the tokens. More context scored lower, not higher.
71% vs 91.6%
task completion with full conversation history versus a pruned-and-summarized context, on the same 50-task benchmark
more context scored lower while costing ~3x the tokens; full history used 1,480,996 tokens, prune+summarize used 553,374
Source: Lodha et al., Less Context Better Agents, arXiv:2606.10209 (2026)
It is not one benchmark, either. A separate 2026 ETH Zürich study found that adding repository context files to coding agents often makes them worse and more expensive, not better. Two different setups, two different research groups, same direction. The detail of each result lives in its own write-up:
Because window size is capacity, and the problem is selection. A larger window changes how much you can fit. It says nothing about what belongs there. In the benchmark above, the full-history run was not failing for lack of room. It fit, at 1,480,996 tokens, and still lost to a context a third its size. Buying a bigger window to fix a context problem is buying a bigger box for a packing problem.
This is also why agents "lose the thread" on a large repository. The repo does not fit a window, and widening the window does not make the agent better at finding the one file that matters. The fix is to serve the canonical answer on demand instead of asking the agent to scan everything and guess. That argument, with the mechanics, is in managing context for AI coding agents.
Twice. Once on the token bill, once on the clock, before you even reach the accuracy gap. In the benchmark, full history burned 1,480,996 tokens and 14.56 hours; the pruned-and-summarized run did the same work in 553,374 tokens and 5.79 hours, and scored higher. At any real volume, that token line is your margin and that runtime line is your latency.
There is a quieter cost too: the tokens an agent spends rereading the same files every session just to work out which one is current. You pay it on every session, every agent, every teammate, and most teams never measure it. The token cost of agents rereading your docs shows where that cost comes from and how to measure it on your own repo.
When you wire knowledge into an agent, usually over MCP, you are really choosing one of three patterns: dump everything in, return candidate chunks for the agent to rank, or serve the single canonical doc that answers the query. They trade off cost, accuracy, and how much work the agent has to do.
Three ways to serve context, and what they cost
Pattern
What the agent gets
Token cost
Accuracy risk
Dump everything
The whole pile, every call
Highest
Noise drowns the answer
Retrieve and rank
Candidate chunks to sift
Medium
Agent re-derives the answer
Serve the canonical answer
One current doc per query
Lowest
Depends on freshness, not volume
The full decision, with when each pattern is the right call, is in MCP context patterns for coding agents. The short version: the more you make the agent reconstruct the answer from raw material, the more you pay and the more it guesses. Serving one current answer per query is the cheapest and the most accurate, as long as that answer stays fresh.
This is the principle our own tooling runs on. trovex exists to hand an agent one canonical document per query instead of the whole repository, so the model reads a clean answer rather than re-deriving it from everything it has ever seen. We built it because our own agent fleet got slower and less reliable the more context we let pile up, which is the same failure these studies measured under controlled conditions. The measured token savings are in the trovex token-savings benchmark.
How do you build a context policy for a production agent?#
You make context a thing the system decides on, the same way you own retries and timeouts. Concretely:
Set a per-agent context policy. Decide what gets kept, what gets summarized, and what gets served fresh from a canonical layer, and at which step boundaries. Write it down. A policy you never wrote is the default the framework picks for you, which is usually "keep everything."
Prune, then summarize, don't just append. Keep the recent tool calls in working context, replace older history with a compact summary instead of carrying it verbatim. That is the move that took the benchmark agent from 71% to 91.6%.
Serve one canonical answer per query. When the agent needs a fact, give it the single authoritative, current answer to read as fresh input, rather than a transcript or document dump to reconstruct it from.
Measure completion against token spend. Track task completion versus tokens, not prompt length. A fuller prompt is not a safer one, and your dashboard should be able to prove it either way.
We map where your agent system hoards context instead of selecting it, what it is costing you in tokens and latency, and the context policy that fixes it, on your stack.
Context engineering is the practice of deciding, at each step, which information an agent reads, rather than feeding it the whole conversation, repo, or document pile. In practice that means pruning stale tool output, summarizing old history, and serving one canonical answer per query. It treats context as a budget to spend, not a log to append.
Does giving an AI agent more context make it better?
Usually not. A 2026 tool-using benchmark found full conversation history scored 71% versus 91.6% for a pruned-and-summarized context, on about a third of the tokens. A separate ETH Zürich study found adding repository context files to coding agents often makes them worse and more expensive. More context tends to add noise, not signal.
Will a bigger context window fix an agent's context problems?
No. A bigger window is capacity, not selection. It lets you fit more in; it does not decide what belongs there. In the 2026 benchmark the full-history run fit inside the window at 1,480,996 tokens and still lost to a pruned run a third its size. The lever is choosing what to keep, not the size of the container.
How do you build a context policy for a production agent?
Decide per agent what gets kept, what gets summarized, and what gets served fresh from a canonical layer, and at which step boundaries. Measure task completion against token spend instead of assuming a fuller prompt is safer. Treat a growing history as a liability to manage, not an asset to protect.