How many tools should an AI agent have?

Fewer than you can mount. Past a small working set, adding tools lowers task success, because the agent has to pick the right one, fill its arguments, and sequence the call correctly, and every extra option is a new way to get that wrong. PlanBench-XL (arXiv 2606.22388) ran 10 models across 1,665 tools and watched accuracy fall from 51.9% to 11.4% as the tool space grew. Give the agent the few tools the task needs.

Updated 19 June 2026

Go deeper: read the full write-up on the blog.

Why a big toolbox backfires

Every tool you mount widens the space the agent has to plan over. It reads more options and gets more chances to call the wrong one or mis-fill an argument. The work the agent does to choose grows faster than the value of having one more tool available, so past a small set the extra reach costs you accuracy.

What PlanBench-XL found

PlanBench-XL (Liu et al., 2026; arXiv 2606.22388) tested 10 models across 1,665 tools on 327 retail tasks. Accuracy fell from 51.9% to 11.4% as the tool space grew and tools failed without a clear error. Agents are worst exactly when a failure carries no explicit signal, or when recovery needs a longer alternative path: the agent builds its next step on a result that never came.

Give it the tools the task needs

Scope the mounted toolset to the task instead of exposing the whole catalogue, force tools to return explicit errors, and log which tool failed and why so the agent can react. It is the same discipline that fixes context bloat: serve the right slice, not everything. trovex does that for your docs (one current answer, ~60% fewer tokens per lookup); the tool equivalent is a scoped per-task toolset. tsukumo builds both with your team.

Straight answers.

Do more tools make an AI agent better?: No, past a small working set. Each tool you mount widens the space the agent has to plan over, so it picks wrong or stalls more often. PlanBench-XL saw accuracy fall from 51.9% to 11.4% across 1,665 tools. A few well-chosen tools beat a big catalogue.
How many tools is too many?: There is no hard number; it tracks the planning space and whether failures are visible, not a fixed count. Mount the few a task needs. Treat wrong-tool picks and stalls as the signal you have over-mounted, and remember the benchmark's 1,665-tool runs degraded hard.
What's the fix for tool sprawl?: Scope the toolset per task, make tools return explicit errors, and log which tool failed and why. Same principle as context: give the agent only the tools the task needs.

Keep reading.