Past a point, adding tools makes an agent worse. The failure is not calling the tool; it is planning across a big tool ecosystem when tools fail silently. A 2026 benchmark across 1,665 tools watched one model fall from 51.90% to 11.36% as conditions got harder.
tsukumo
Short version: the instinct is to hand the agent more tools. Connect another MCP server, mount the whole catalogue. The instinct is wrong. Past a point, every tool you add is one more thing the agent has to plan around, and the planning is where it breaks. Not the calling. An agent that can call any one of 1,665 tools perfectly can still come apart deciding which to use, in what order, and what to do when one of them dies quietly mid-task. A 2026 benchmark put a number on the collapse. Every tool you mount is weight the planner has to carry on every turn, and most teams are still counting it as horsepower.
Up to a small ceiling, then no. After that they make it worse, and the curve bends down faster than most teams expect.
The reasoning is simple once you say it out loud. Every tool you expose is another branch the planner has to consider at every step. A handful of tools is a tractable decision. Sixteen hundred is a search problem the model has to solve before it does any real work, on every turn, while also reasoning about the actual task. Capability per tool is not the bottleneck. The bottleneck is planning over the set.
This is the part that surprises people. The agents in the study were not failing to invoke functions. They failed to plan a coherent path through a large tool ecosystem, and the bigger the ecosystem, the worse it got.
“327 retail tasks, 1,665 tools, 10 leading models. Tests iterative tool retrieval and invocation toward a goal under realistic constraints including tool failures and missing functions. Finds performance deteriorates sharply as environmental unpredictability rises, and that agents are especially fragile when failures carry no explicit error signal or recovery needs a longer alternative path.”
Because the failure mode is planning under unpredictability, and a big tool set is mostly unpredictability you volunteered for.
In a clean world, more tools would be harmless. The agent picks the right one and proceeds. Production is not clean. Tools time out. They return partial data. A function the plan depended on is missing. Each of those is a fork the agent now has to handle, and the more tools in play, the more forks per task. The study is a measurement of exactly this: hold the model fixed, make the environment less predictable, and watch the accuracy fall off a cliff.
51.90% to 11.36%
GPT-5.4 task accuracy from clean to heavy tool-blocking
10 models tested across 1665 tools on 327 retail tasks; performance falls sharply as the environment grows unpredictable
Source: Liu et al. PlanBench-XL (arXiv:2606.22388)
Read that gap. The same model, the same tools, the same tasks. The only thing that changed was how often a tool refused to cooperate, and four fifths of the performance evaporated. That is not a capability problem you fix with a better model. It is a planning-under-failure problem, and it is the one a big tool ecosystem makes worst. The general catalogue of why coding agents fail covers the broad shape; this is the specific tool-sprawl slice of it.
The agent keeps planning against a result that never came. This is the dangerous one.
A loud failure is recoverable. The tool throws, the agent sees the error, it tries another path. The study found agents handle the loud case far better than the quiet one. The quiet case is where a tool returns nothing, or returns a stale success-shaped response, and emits no explicit error signal. The planner has nothing to react to. So it treats the non-result as a result and builds the next three steps on top of it. By the time anything looks wrong, the agent is several moves past the actual fault, reasoning confidently off a hole in the data.
“The number of tools is a liability, not a feature. Agents do not break on the call. They break planning across a big tool set when a tool fails and says nothing.”
— the thesis
We have written about silent failures inside one production pipeline. Same disease at ecosystem scale. One silent tool in a set of five is a bug you find. One in a set of sixteen hundred hides, because the agent had a hundred other plausible paths and you have no idea which one it took.
Fewer than it could. Scoped to the task in front of it, not to everything it might ever do.
The number matters less than the discipline. Mount the whole MCP catalogue and you have handed the planner the worst version of the search problem on every turn. The better shape is to expose the smallest set the current goal needs, and retrieve tools on demand instead of standing them all up at once. Same principle we apply to context: one current answer beats the whole library. This is the MCP and context discipline for coding agents pointed at tools rather than docs.
Two ways to give an agent its tools
Property
Huge ecosystem, silent failures
Scoped set, explicit errors
Tools in scope per task
Everything mounted
Smallest set the goal needs
When a tool fails
Returns nothing, agent plans on
Throws an explicit error signal
Recovery path
Long, often never found
Short, the failure is visible
What you can review
A confident wrong answer
Which tool failed and why
Same move trovex is built around: resolve to the one current answer instead of re-searching the whole set. For tools, that means the agent reasons over a short relevant menu, not the entire pantry.
Scope the tool set, so the planner is reasoning over a tractable menu instead of a catalogue. Demand explicit error signals, so the silent-failure case from above becomes a loud one the agent can actually recover from. And put observability on the tool layer, so when a path does break you can see which tool failed and why, instead of inferring it from a wrong answer days later. That third move answers the study's worst finding directly. The "no explicit error signal" failure is exactly what observability surfaces, so we built yoru for the tool boundary: a dead tool leaves a trail. Across many agents the same logic moves up to the fleet, where wrai.th orchestrates them and which-tool-failed-in-which-agent has to stay answerable.
This is the tool-ecosystem axis of a thesis we keep returning to: agent reliability is not pass@1. A demo with five tools and a clean network is a pass@1. Production is 1,665 tools and a flaky afternoon, and that is the run that decides whether you ship.
Most of the tool-sprawl work we see starts with a team proud of how many integrations their agent has. It also behaves strangely under load and nobody can say why, because the failures are quiet and the search space is enormous. The fix is almost always subtractive. Cut the mounted tools to what the task needs, make every tool fail loud, and watch the tool layer so a silent death shows up the moment it happens instead of weeks later in a support ticket.
One honest boundary. This is one benchmark, on retail tasks, and the decimals will move as others run it. The 51.90% and the 11.36% are GPT-5.4 on this suite, not a law of nature. The operating lesson does not depend on the decimals. Scope the tool set. Demand explicit errors. Observe which tool failed and why. Those hold whether the collapse is four fifths or one third, because the mechanism, planning over a big set under silent failure, is the same either way.
If your agent is connected to everything and you cannot say which tool it leaned on when it went wrong, that is the work we do.
Part of our proof index, the studies and benchmarks behind what we claim.
We scope the tool set, force explicit error signals, and make which-tool-failed visible at the tool boundary.
No. More tools widen the planning space the agent has to reason over, and past a point that hurts. PlanBench-XL (Liu et al., 2026) tested 10 models across 1,665 tools on 327 retail tasks and found performance falls sharply as the environment gets harder to predict. GPT-5.4 dropped from 51.90% in clean conditions to 11.36% under heavy tool blocking. The tool count is a liability, not a feature.
Why do agents fail with many tools?
Not because they cannot call the tool. They fail at planning a path through a large tool set when tools break mid-task. PlanBench-XL (2026) found agents are especially fragile when a failure carries no explicit error signal, or when recovery needs a longer alternative tool-use path. The agent cannot recover from a failure it cannot see, so it keeps planning against a tool that is already dead.
How many tools should an agent have?
Fewer than you think, and scoped to the task in front of it. The exact number is less important than the discipline: expose the smallest tool set the current goal needs, retrieve tools on demand rather than mounting all of them, and prefer one current answer over the whole catalogue. PlanBench-XL's 1,665-tool runs degrade hard, which is the case against mounting everything an agent might ever use.
What happens when a tool fails silently?
The agent keeps going as if it succeeded. A silent failure returns no error signal, so the planner has nothing to react to and builds the next steps on a result that never came. PlanBench-XL (2026) names this one of the conditions agents handle worst. The fix is to force explicit error signals and to log which tool failed and why, so the agent and you can both see it.