Designing multi-agent systems that don't fall apart
Subagents, handoffs, and shared memory. A practical guide to orchestration patterns that stay reliable as complexity grows.
A single agent with good tools and a clear prompt will take you surprisingly far. The trouble starts the day you split that agent into several. Suddenly a task that worked on Tuesday hangs forever on Wednesday, and nobody can say which of the six agents in the chain is to blame. The model didn't get worse — your system got harder to reason about, and you didn't change anything to keep up.
Multi-agent systems fail in a specific way. Not with a crash, but with a slow erosion of trust: outputs that are subtly wrong, runs that cost ten times what you expected, behavior that nobody can reproduce. We've built and rescued enough of these to have strong opinions about what holds up. Almost none of it is about picking a cleverer framework. It's about treating the seams between agents as the actual product.
Key takeaways
- Add agents only when a single agent is provably struggling — every additional agent multiplies handoff surface area and debugging complexity.
- Every handoff between agents must be a typed, validated contract, not free text; malformed handoffs that travel unchecked are the leading cause of "confidently wrong" outputs.
- Shared memory in a multi-agent system is a concurrent database; scope it, assign ownership per field, and stamp every write with provenance.
- The supervisor/worker pattern is almost always the right starting topology: one explicit decision-maker, debuggable by design.
- Anthropic's internal research system — using an orchestrator with parallel subagents, each in an isolated context window — outperformed a single Claude Opus 4 agent by 90.2% on their internal evaluation, while consuming roughly 15× more tokens than a standard chat interaction. Token efficiency is a first-class concern.
- Every production multi-agent run needs a single trace ID that threads through every agent, handoff, memory write, and tool call — without it, debugging is archaeology.
First, earn the second agent
The most reliable multi-agent system is the one you didn't build. Every agent you add multiplies the number of handoffs, the surface area for context to get lost, and the places a run can stall. So the first discipline is restraint: don't decompose a task into subagents until a single agent is provably struggling with it.
Research by Anthropic has been direct about this cost: teams have "invested months building elaborate multi-agent architectures only to discover that improved prompting on a single agent achieved equivalent results," and multi-agent implementations "typically use 3–10× more tokens than single-agent approaches for equivalent tasks" (Anthropic, 2025). The overhead is real and compounds quickly.
There are real reasons to split. A subtask needs a tool the main agent shouldn't have. A stage genuinely runs in parallel and you want the wall-clock win. One step needs a different model, a tighter prompt, or its own evaluation. A 2025 study from Google Research, Google DeepMind, and MIT — "Towards a Science of Scaling Agent Systems" (Kim et al., 2025) — ran 260 configurations across six benchmarks and found that performance ranged from +80.8% on decomposable financial reasoning tasks to −70.0% on sequential planning tasks depending on architecture choice. Parallelizable, independent work benefits from multi-agent coordination; sequential tasks with shared state often suffer from it. Those are good reasons. "It feels more organized" is not — splitting work across agents to make a diagram look tidy is how you turn one prompt you can debug into five you can't.
Every agent you add is a new place for the system to lie to you about why it failed.
Handoffs are contracts, not vibes
When one agent hands work to another, what actually crosses the boundary? In the systems that fall apart, the answer is "a blob of free text and a hope." The downstream agent re-derives intent from prose, guesses at what the upstream agent meant, and the error compounds quietly down the chain.
Treat every handoff as a contract with a schema. The upstream agent emits a structured object — the task, the inputs it used, the result, a confidence signal, and explicitly what it did not handle. The downstream agent validates that object before it does anything. If the contract isn't met, you fail loudly at the boundary instead of letting a malformed handoff travel three more steps before it surfaces as a nonsense answer.
OpenAI's Agents SDK formalizes this intuition directly: handoffs in their framework are surfaced as callable tools to the LLM, and support an input_type schema for model-generated metadata — reason, priority, language — that transfers structured context at delegation time. The framework also provides input_filter functions that control exactly which conversation history the receiving agent sees, making context boundaries explicit rather than implicit.
A 2025 study by Cemri et al. — "Why Do Multi-Agent LLM Systems Fail?" — analyzed over 1,600 annotated execution traces across seven multi-agent frameworks. It identified 14 distinct failure modes clustered into three categories: system design issues (including improper task routing), inter-agent misalignment (communication and coordination failures), and task verification failures (inadequate output validation). The research concluded that "performance gains on popular benchmarks are often minimal" and flagged inadequate error handling and missing output validation as primary systemic issues.
This sounds bureaucratic until the first time a typed handoff catches a problem at the source. A good contract also makes the system legible: you can read the object that crossed any boundary and know exactly what each agent was working with. That single property is worth more than any orchestration framework.
Whether you choose a supervisor or a swarm, the seams between agents are what you actually operate and debug.
Shared memory is a database, so treat it like one
Most non-trivial systems end up with shared state — a place agents read context from and write results to. The mistake is treating it as a convenient scratchpad instead of what it really is: a concurrent database that several non-deterministic writers are hitting at once.
Be deliberate about three things. First, scope: give each agent the narrowest slice of memory it needs, not the whole context window. An agent that can see everything will use everything, and your token bill and your error rate both climb. Second, ownership: decide who is allowed to write each field, so two agents don't quietly clobber each other's work. Third, provenance: stamp every write with which agent produced it and when, so that when the output is wrong you can trace it back to a source rather than guessing.
Anthropic's own multi-agent research system, described in their engineering blog, deliberately isolates each subagent in its own context window — the subagent's full chain-of-thought and intermediate tool results stay local, and only a compressed, relevant output travels back to the orchestrator. This is scoped memory by design, not accident. Token usage in that system runs roughly 15× that of a standard chat interaction, and token count alone explained 80% of performance variance in their BrowseComp evaluation — a number that makes the case for frugality forcefully.
The teams that skip scope, ownership, and provenance get the agent equivalent of a race condition — intermittent, unreproducible failures that depend on the order things happened to run. Those are the worst bugs to chase, and they're entirely self-inflicted.
Supervisor or swarm?
There are two honest answers to "how should the agents relate to each other," and the right one depends on how much you value control versus flexibility.
A supervisor model puts one agent in charge of the plan. It decides what runs next, hands work to specialist workers, and assembles the result. The enormous advantage is that there is always one place where the next decision is made — which means there's always one place to look when something goes wrong. The cost is that the supervisor is a bottleneck and a single point of failure, so its own logic has to stay simple and well-tested.
A swarm lets peer agents coordinate more directly, picking up work and passing handoffs without a central conductor. It can be more flexible and more parallel. It is also dramatically harder to debug, because behavior emerges from interactions rather than from one explicit plan. Research on resilience in LLM-based multi-agent collaboration (Zhao et al., ICML 2025) found that hierarchical structures suffered only a 5.5% performance drop when faulty agents were introduced, significantly outperforming linear (10.5% drop) and fully bidirectional/swarm (23.7% drop) configurations. The same study showed that adding an Inspector verification agent could recover up to 96.4% of errors introduced by faulty agents — a strong argument for keeping a verification layer even in more distributed topologies. Our default is to start with a supervisor and only move toward a swarm when a specific bottleneck forces it — and even then, to keep a supervised layer around the swarm so there's still somewhere to put the error recovery.
LangGraph formalized the supervisor pattern with a dedicated library released in February 2025, where a single orchestrator handles all routing and workers communicate exclusively through it — making the decision graph explicit and auditable.
Pattern comparison
| Pattern | Control | Parallelism | Debug ease | Best for |
|---|---|---|---|---|
| Supervisor / worker | High | Medium | High | Most production systems |
| Sequential handoff | High | Low | High | Well-defined linear pipelines |
| Shared memory / blackboard | Medium | High | Low | Loosely coupled parallel work |
| Peer swarm | Low | High | Low | Exploratory or creative tasks |
| Error recovery layer | — | — | — | Everywhere, as a wrapper |
Supervisor / worker. One agent owns the plan and routes subtasks to specialists. Easy to reason about and to debug, because there is always a single place where the next decision is made. The risk is the supervisor becoming a bottleneck and a single point of failure — keep its job narrow.
Sequential handoff. A pipeline where each agent does one stage and passes a typed result to the next — research, then draft, then review. Predictable and cheap to trace. It only works when the stages are genuinely independent; loops and backtracking are where it gets ugly.
Shared memory / blackboard. Agents read and write to a common store instead of messaging each other directly. Great for loosely coupled work and parallelism, but you pay for it in concurrency bugs — two agents overwriting the same field is the new race condition.
Error recovery / supervisor retry. A layer whose only job is to catch a failed step, decide whether to retry, reroute, or escalate, and keep the rest of the run alive. The pattern most teams add last and wish they had added first.
Isolate failure, then recover from it
In a single-agent system, a failure is obvious: the one thing failed. In a multi-agent system, a failure in one worker can corrupt shared memory, stall a supervisor waiting on a result that will never come, or send a malformed handoff downstream that produces a confidently wrong answer. The failure isn't contained unless you contain it.
Two principles do most of the work. Isolate: a worker that fails should fail within its own boundary — it shouldn't be able to write garbage into shared state or block the whole run. Give every agent a timeout, a token ceiling, and a clear failure path. Recover: have an explicit layer that decides what happens when a step fails. Retry with a tweak? Route to a different worker? Degrade gracefully and return a partial result? Escalate to a human? Those are decisions you want to make once, in code, not improvise inside a prompt at 2am.
Anthropic's guidance on agent reliability puts it plainly: "The autonomous nature of agents means higher costs, and the potential for compounding errors" and recommends "extensive testing in sandboxed environments, along with the appropriate guardrails" (Anthropic, "Building Effective AI Agents"). The BAIR blog's 2024 framing of compound AI systems is useful here too: a key advantage of multi-component systems is that they allow output verification and filtering that a single model cannot do for itself — but only if you wire that verification in deliberately.
Anti-pattern — The most expensive failure mode we see is the silent infinite loop: agent A asks agent B for help, B asks A, and the run burns tokens until something times out — if anything does. Always cap the depth of delegation and the total step count of a run. An agent system without a hard budget isn't autonomous, it's unbounded.
Make the whole run debuggable
The single biggest predictor of whether a multi-agent system survives contact with production is whether you can answer one question quickly: what happened on this run? If the answer takes an afternoon of grepping logs, the system will rot, because nobody will be able to fix anything with confidence.
Give every run a trace. One identifier that threads through every agent, every handoff, every memory write, and every tool call, so you can replay the whole thing as a single timeline. Record the actual inputs and outputs at each boundary, not just "step 3 completed." When you can open a run and see the exact object that crossed every seam, debugging a six-agent system becomes about as hard as debugging one — which is the entire point.
LangChain's observability team articulates the production reality well: "most agent debugging workflows still assume engineers will sift through logs to find root causes, but production trace volume makes this approach unsustainable" (LangChain, 2025). Their recommended pattern — capture production traces, build test datasets from real usage, run evaluations, and drive targeted improvements — depends entirely on having clean per-run traces in the first place.
This is also where evaluation lives. You can't meaningfully evaluate a multi-agent system end-to-end only; you have to be able to score individual stages, because a 90%-reliable pipeline of five 98%-reliable agents is a real and brutal piece of arithmetic. Traceable stages are what make per-stage evaluation possible at all.
A multi-agent system you can't trace isn't a system. It's a rumor about what your agents might be doing.
Reliability is a property you design in
None of this is exotic. Typed handoffs, scoped and owned shared state, a supervisor you keep simple, hard budgets, failure isolation, and a trace per run — these are the unglamorous decisions that decide whether your system holds together at ten agents or quietly comes apart at four.
The teams that ship durable multi-agent systems aren't the ones with the most agents or the cleverest topology. They're the ones who treated the boundaries between agents as the real engineering problem, kept the system legible as it grew, and resisted adding the second agent until the first one had earned it. Build it that way and complexity stops being something that happens to you — and starts being something you actually chose.
Frequently asked questions
What is the supervisor/worker pattern in multi-agent systems? The supervisor/worker pattern places a single orchestrator agent in charge of planning and routing. It receives the top-level task, decomposes it into subtasks, delegates each to a specialist worker agent, and assembles the final result. The key advantage is that there is always one explicit place where the next decision is made, which makes failures locatable and recoverable. LangGraph's supervisor library and OpenAI's Agents SDK both implement this pattern as a first-class primitive.
How should agents pass context to each other during handoffs? Through a typed, validated schema — not free-text prose. The upstream agent should emit a structured object specifying what task was handled, what inputs were used, what the result is, what was explicitly left unhandled, and a confidence signal. The downstream agent validates that schema before acting. This prevents malformed outputs from silently propagating down the chain and is the single most effective way to contain errors at their source.
When do multi-agent systems hurt rather than help? When tasks are sequential rather than parallelizable, when a single agent's context window isn't actually the bottleneck, or when coordination overhead exceeds the benefit of parallel execution. A 2025 Google/MIT/DeepMind study (Kim et al., arxiv:2512.08296) found multi-agent setups dropped performance by up to 70% on sequential planning tasks while boosting it by 80.8% on decomposable financial analysis — the difference is whether the subtasks are genuinely independent.
How do you debug a multi-agent run that produced a wrong answer? Start with the trace. Every production multi-agent system should assign a single run ID that propagates through every agent call, tool use, memory write, and handoff. With that trace, you can isolate exactly which agent received what input and what it emitted — turning a six-agent debugging session into a binary search. Without it, you're reconstructing intent from timestamps and guesswork. Per-stage evaluation (not just end-to-end scoring) is the next layer: it lets you pinpoint which stage degraded rather than hunting through the whole pipeline.
Sources
- How we built our multi-agent research system — Anthropic Engineering (2025) — 90.2% performance improvement over single agent, 15× token overhead, 80% performance variance explained by token count
- When to use multi-agent systems (and when not to) — Anthropic / Claude Blog (2025) — 3–10× token overhead vs. single-agent; teams investing months in multi-agent architectures only to find single-agent prompting equivalent
- Building Effective AI Agents — Anthropic Research (2024) — guidance on orchestrator-workers pattern, compounding error risk, sandboxed testing requirement
- Why Do Multi-Agent LLM Systems Fail? — Cemri et al., arXiv:2503.13657 (2025) — 14 failure modes across 3 categories, 1,600+ annotated traces, 7 MAS frameworks
- On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents — Zhao et al., ICML 2025, arXiv:2408.00989 — hierarchical topology 5.5% drop vs. swarm 23.7% drop under faulty agents; Inspector mechanism recovers 96.4% of errors
- Towards a Science of Scaling Agent Systems — Kim et al., arXiv:2512.08296 (2025) — 260 configurations, +80.8% on decomposable tasks vs. −70.0% on sequential planning tasks
- The Shift from Models to Compound AI Systems — Berkeley AI Research Blog (February 2024) — definition of compound AI systems; verification and control advantages of multi-component architectures
- Handoffs — OpenAI Agents SDK Documentation (2025) — structured handoff via tool calls,
input_typeschema,input_filterfor context control - LangGraph Supervisor: A Library for Hierarchical Multi-Agent Systems — LangChain Changelog (February 26, 2025) — supervisor/worker pattern formalized; single supervisor routes to isolated workers
- AI Agent Observability: Tracing, Testing, and Improving Agents — LangChain (2025) — per-run trace as foundation for production debugging; capture → analyze → evaluate improvement loop