If you've built an LLM agent that does anything beyond a narrow task, you've probably watched this happen. The first version is sharp. You add a few more tools. Performance improves. You add a few more skills. Still better. You give it the full toolkit it should need to handle the real workflow, and somewhere in there, quietly, the agent gets worse. Wrong tool calls. Forgotten constraints. Loops where it corrects its own mistakes by making new ones.
We've been watching this pattern across deployments at Strongly.AI, and it has a name in the recent literature, though no one calls it the same thing twice. We've been calling it agent saturation: the point where adding more capability to a single agent stops paying off and starts compounding into failure.
This post lays out what the research is now showing, the approaches we already ship at Strongly to push the saturation point further out, and the next step we're building (Fission) for cases where pushing isn't enough.
The pattern is real, and the research is starting to nail it down
Three converging threads in the recent literature describe what we're seeing in production.
The most direct evidence comes from Xiaoxiao Li at UBC, whose January 2026 paper tested skill libraries of varying sizes against LLM agents [1]. The finding: selection accuracy doesn't degrade gracefully as you add skills. It follows a phase transition. Accuracy stays high until you cross a critical library size, then drops sharply. And the trigger isn't raw skill count. It's semantic confusability between skills. Two tools that do similar things hurt the agent more than ten unrelated ones do. The paper frames this as a bounded capacity analogous to human decision-making, and shows that hierarchical organization of skills mitigates the collapse.
A separate study from UC Santa Barbara and MIT, published in April 2026 [2], pushed this further by testing skill use under realistic conditions where the agent has to retrieve from a library of 34,000 real-world skills rather than being handed the right one. The benefit of skills, they found, is fragile. Pass rates degraded steadily as conditions became more realistic, approaching the no-skill baseline in the hardest settings.
In parallel, the MCP and tool-use community has been quantifying tool overload. Engineering teams now routinely report that large MCP tool sets can consume 20 to 30 percent of the context window in metadata alone, before the user has even sent a message [3]. Some clients have started capping the number of tools that can be exposed at once for exactly this reason.
Underneath all of this is the phenomenon Liu et al. documented in their TACL 2024 paper Lost in the Middle [4]: LLMs follow a U-shaped attention curve, performing strongly when relevant information sits at the start or end of the context and poorly when it sits in the middle. Practitioners have since taken to calling the degradation that results from this context rot. It's an architectural property of attention, not something bigger context windows fix. The million-token model isn't reading every token with equal care.
What makes these failure modes especially nasty is that they're self-reinforcing. A saturated agent makes more mistakes. Mistakes trigger corrections. Corrections add more tokens, more tool calls, more noise. By the time a human notices the agent is off the rails, the context is already polluted past recovery.
“A saturated agent makes more mistakes. Mistakes trigger corrections. Corrections add more tokens, more tool calls, more noise.
What works today, and what we already ship
There are three approaches the field has converged on for delaying saturation, and we've built each of them into Strongly because they each pay real dividends before any agent hits its ceiling.
Hierarchical Skill Routing
Group capabilities by domain and intent. Select coarse-to-fine, so the agent narrows the candidate set before it ever sees the full skill list.
ShippedSub-Agent Orchestration
Designer-defined specialized agents with shared state and structured handoffs. Bounded contexts at the cost of inter-agent coordination overhead.
ShippedContext Compaction
Distill older turns into structured summaries, offload bulky tool outputs to retrievable references. Keep the live context lean as sessions run long.
ShippedHierarchical skill routing
Flat selection across hundreds of skills is where the phase transition bites hardest. The fix, supported by the cognitive-science analogy in [1] and commercial systems like Writer's MCP Gateway [5], is to organize skills into a hierarchy and select coarse-to-fine. Strongly's skill router groups capabilities by domain and intent, so the agent narrows the candidate set before it ever sees the full list. In our internal benchmarks this alone pushes the saturation threshold meaningfully higher.
Sub-agent orchestration
Splitting work into specialized agents with bounded contexts is the most common production fix, and frameworks like Anthropic's, Google ADK [6], and others all support it natively. We support designer-defined sub-agents in Strongly with shared state and structured handoffs. The catch, which the research is increasingly explicit about, is cost. Anthropic has reported multi-agent systems use roughly 15 times the tokens of single-agent chat [7], and the MAST failure taxonomy by Cemri et al. at UC Berkeley found that inter-agent misalignment accounts for roughly a third of observed failures across seven open-source multi-agent frameworks [8]. Splitting works, but indiscriminate splitting trades one problem for a more expensive one.
Context compaction and memory management
As sessions run long, the context fills with intermediate output, tool results, and dead branches. Compaction, summarization, and external memory all push the saturation point further out by keeping the live context lean. Strongly's memory layer handles this automatically, distilling older turns into structured summaries and offloading bulky tool outputs to retrievable references rather than keeping them inline. Recent work on context engineering, including context folding and pointer-based outputs, points in the same direction [9].
These three together get you a long way. Most teams don't need anything more sophisticated until they're operating at real scale, and we'd rather a customer ship with a well-routed, well-orchestrated, well-compacted single agent than over-engineer from day one.
These three together get you a long way. Most teams don't need anything more sophisticated until they're operating at real scale, and we'd rather a customer ship with a well-routed, well-orchestrated, well-compacted single agent than over-engineer from day one.
Where these approaches run out
Each of the three has a ceiling that comes from the same root: they're static or reactive.
Hierarchical routing assumes the hierarchy holds. When a task cuts across your categories, the agent still has to reason across them, and the saturation problem reappears at the routing layer.
Sub-agent orchestration assumes you know the right split at design time. Real workloads don't sit still. The right decomposition for last quarter's task mix isn't necessarily the right one for this quarter's.
Compaction buys time but doesn't address the underlying skill or tool overload that caused saturation in the first place. You can summarize the history all you want; if the live agent is still trying to choose between 80 semantically similar skills, the wall is still there.
What none of the current approaches do is ask the question we think actually matters: can the agent itself recognize, in flight, that it's about to saturate, and restructure before it does?
“Can the agent itself recognize, in flight, that it's about to saturate, and restructure before it does?
Fission: detect, split, coordinate
Fission is the next layer of Strongly. It treats agent saturation as a control problem rather than a design problem.
Detect
Fission continuously scores an agent's saturation risk from a composite of signals: skill-library size and pairwise semantic similarity, tool-selection entropy, repeated-correction patterns, context occupancy, and a complexity estimate of the current task trajectory. The goal is to catch saturation before the correction-loop spiral starts, not after. The signal design draws on the diagnostic recipes emerging in the context-rot and skill-routing literature, extended with skill-graph metrics specific to our deployment data.
Split
When risk crosses a threshold and projected post-split performance, net of coordination overhead, beats the current trajectory, Fission generates a split plan. It identifies natural cleavage lines in the skill and tool graph, partitions responsibilities along those lines, and provisions each resulting agent with the minimum context it needs. The split is computed at runtime against the actual task, not against a designer's guess from months ago.
Coordinate
The handoff is where most multi-agent systems fail, which is why inter-agent misalignment dominates the failure taxonomy [8]. Fission generates explicit coordination contracts: shared state schemas, message protocols, and failure-recovery paths. The resulting agents behave as a system rather than as two competing soloists.
The bet underneath all of this is that saturation isn't a fixed property of an agent's design. It's a trajectory property that emerges from the interaction between an agent's capabilities and the task in front of it. Detection has to be runtime. Restructuring has to be automatic. Anything less just pushes the wall a little further out, which is exactly what the existing three approaches already do well.
The bet underneath all of this is that saturation isn't a fixed property of an agent's design. It's a trajectory property that emerges from the interaction between an agent's capabilities and the task in front of it.
Where this is going
The next year of agent engineering will look less like "build a smarter agent" and more like "build agents that know how to reorganize themselves." The single-agent-with-skills paradigm has a ceiling, and the literature is increasingly clear about where it sits. The multi-agent-from-day-one paradigm has its own ceiling, coordination overhead, and most teams aren't ready to pay it.
There's a middle path: start with a single agent, route and compact aggressively, split on demand when the trajectory demands it, and coordinate cleanly when you do. That's the path Strongly is built for, and that's what Fission completes.
If you're hitting the wall on your own agent, or you're not sure whether you are but performance feels like it's plateauing for reasons you can't pin down, we'd love to talk. Saturation is easier to measure than most teams realize, and once you can measure it, you can do something about it.