Scaling an agent fleet breaks on discovery, not difficulty
A 2026 enterprise benchmark up to 200 agents found that agent count, not task complexity, dominates performance, and the bottleneck at scale is agents finding the right peer and context. Here's what breaks at each tier and the orchestration move for each.
tsukumo
Short version: you stand up a few agents, they work, the payoff is obvious, so you add more and expect the line to keep going up. It doesn't. The wall you hit has almost nothing to do with how hard the tasks are. A 2026 enterprise benchmark ran agent fleets up to 200 strong and found that the dominant factor in performance was agent , not task complexity. The bottleneck was agents spending their effort finding the right peer and the right context to act on. That is an orchestration problem, and no bigger model fixes it. This piece is the playbook for what breaks at each size and what to build before it does.
arXiv:2606.20058 (Dhanyamraju, Raghav, Lee, 2026) makes a point most agent evaluations skip. Existing multi-agent systems assume discrete request-response workflows and almost never get tested at enterprise scale under continuous event monitoring. So the authors built that test: 208 production-derived enterprise scenarios, run across three sizes, with two orchestration approaches under the hood, DAG Plan-and-Execute and ReAct.
The three scales are the spine of everything below:
Persona: under 10 agents.
Department: 20 to 80 agents.
Enterprise: 200 agents.
Both orchestration approaches held up at small scale. Both degraded at enterprise scale. The variable that moved the result was how many agents were in the room, not how gnarly any single task was. And the specific thing that broke was discovery: the overhead of an agent locating the right peer and the right context climbed until it became the primary bottleneck.
This sits inside a wider research picture; if you want the map of how the multi-agent studies fit together, the full research picture is here. What follows is the operational read: you're at N agents heading to 3N, and you want to know what gives way and what to build for it.
The instinct, when a fleet stalls, is to assume the tasks got too hard and reach for a stronger model. The benchmark says that is mostly a misread. Task difficulty was not what separated the runs. Fleet size was.
That lines up with how coordination failures actually behave. The MAST taxonomy of why multi-agent systems fail catalogs failures that live in specification and coordination, not raw capability. Scale doesn't introduce a new failure mode so much as it multiplies those. With eight agents, a missed handoff is a hiccup. With two hundred, the same coordination gap is happening constantly, and the noise of everyone trying to find everyone else swamps the useful work.
What actually drives the wall
Lever
Task complexity
Fleet scale
Was it the dominant factor in the benchmark?
no
yes
What it stresses
a single agent's reasoning
coordination between agents
The fix it points to
a bigger or better model
the orchestration layer
Gets worse as you add agents
roughly flat
compounds
So the planning question isn't "are my tasks too hard for the agents." It's "how much of my fleet's effort is going into finding each other instead of doing the work." That ratio is what degrades, and it degrades on a schedule you can predict by tier.
The by-tier playbook: what breaks and what to build#
This is the part the research picture doesn't give you. The benchmark's three scales map cleanly onto the decision a growing team actually faces. Here is what each tier feels like and the orchestration move it demands.
Persona tier (under 10 agents): build the habits while they're cheap#
At this size everything works, which is the trap. Discovery is trivial because there's barely anyone to discover. A direct address or a broadcast costs nothing. You can run on prompt-passing and a shared scratchpad and feel fine.
The failure here is one of timing, not performance: teams build no orchestration discipline because they don't need it yet, then carry that absence into the Department tier where it bites. The move is to decide now whether you should even be scaling, and to lay the primitives early. Our note on when to scale your agent setup is the gate to clear before you grow, because a second agent you don't need is pure coordination cost with no payoff.
Department tier (20-80 agents): discovery noise starts billing you#
This is where the bill arrives. The benchmark's degradation isn't a switch that flips at 200; the Department band is where coordination overhead stops being free. Broadcast patterns that were harmless at eight agents now mean every message lands in dozens of inboxes. Agents start acting on stale or wrong context because the right context was harder to locate than the nearest plausible one.
This is the tier to build the real orchestration layer, before the fleet stalls, not after. The mechanics that matter are the ones in orchestrating AI coding agent fleets: priority so the important work jumps the queue, scoped handoffs so an agent passes to the right peer instead of shouting into the room, and shared memory so context is fetched, not re-derived. The context half is the one place we have a first-party number: trovex serves the currently-correct slice for a task and cuts roughly 60% of the tokens per lookup, so an agent works from the real context instead of re-reading everything to find it. That is discovery noise paid down directly.
Enterprise tier (200 agents): orchestration is the system#
At 200 agents the benchmark found discovery noise becomes the primary bottleneck outright. This is not a tuning problem anymore. The coordination layer is the system, and the agents are almost incidental to whether it holds.
The authors tested a Task Manager component for continuous operation: priority inference, related-event merging, and preemption. At enterprise scale it cut high-priority queue latency by 14-75% and improved related-event correctness by over 20 percentage points. To be precise about whose result that is: the Task Manager is the authors' component, not ours. We cite those numbers as independent evidence for one principle, that orchestration, not the model, is what recovers a fleet at scale. The thing that was failing was coordination, and a coordination layer fixed it, with the model held constant.
We run our own growth fleet on a relay we call wrai.th: agents claim tasks, hand off to a named peer, and message over a shared log instead of broadcasting. That shape is deliberate, and it's the same shape the benchmark's result argues for. A shared log means context is there to be read rather than re-derived, which is the discovery cost paid down. Claiming a task means two agents don't both grab it, which is the kind of concurrent-agent collision that gets noisier with every agent you add. Named handoffs mean an agent routes to the right peer instead of paying the find-everyone tax the benchmark measured.
None of this is a single tool you install. It's a layer, and it sits inside the broader operating map: context, orchestration, observability, and the operating model around the fleet. If you want that whole picture, running agent fleets in production lays out all four. The benchmark's contribution is to tell you which of those four is load-bearing as you scale: it's orchestration, and it's load-bearing earlier than the demo suggests.
We've hit this wall ourselves. The fleet that's smooth at six agents and starts thrashing at thirty is not running into harder tasks; it's running into itself. Every time, the fix was the coordination layer, not the model. Build priority before you need it, scope your handoffs so agents talk to the right peer instead of the whole room, and put context in a shared log so finding it isn't the job.
If you're at N agents and planning for 3N, the question worth answering before you scale is how much of your fleet's effort already goes into finding each other. That ratio is what breaks. We can help you read it.
We map where your fleet spends effort coordinating instead of working, and what to build before the next tier.
Does adding more AI agents make a system perform worse?
It can, and scale matters more than task difficulty. A 2026 enterprise benchmark (arXiv:2606.20058) tested two orchestration approaches across 208 scenarios at up to 200 agents and found scale, not complexity, is the dominant performance factor. Both architectures held at small scale and degraded at enterprise scale. The cause was discovery noise: the overhead of each agent finding the right peer and context.
What is the main bottleneck in large multi-agent systems?
Agent-discovery noise, the overhead of agents finding the right peer and the right context to act on. The 2026 enterprise benchmark in arXiv:2606.20058 found this becomes the primary bottleneck at the 200-agent scale, while smaller fleets held up. It is a coordination problem, not a reasoning one, which is why a larger model does not fix it and a better orchestration layer does.
How do I fix multi-agent orchestration at scale?
Build the orchestration layer rather than swapping the model. The same benchmark showed a task-manager component (priority inference, related-event merging, preemption) cut high-priority queue latency by 14-75% and improved related-event correctness by over 20 percentage points at enterprise scale. The transferable moves are priority, scoped handoffs so agents do not broadcast to everyone, and a shared log for context.
How many agents can a fleet run before it needs orchestration?
Earlier than most teams expect. The 2026 benchmark grouped scales as Persona (under 10 agents), Department (20-80), and Enterprise (200). Both orchestration approaches held under 10 and degraded toward 200. Coordination overhead starts mattering in the Department tier, so the orchestration layer should be built before you cross 20 agents, not retrofitted once it stalls.