Why one agent stops being enough · The Builder's Stack

Decompose the work

It looked like one prompt: take the list of integration partners, read each partner's API changelog, and write a migration note for every breaking change. Ninety minutes in, you scroll the transcript. The early notes are sharp. The later ones blur, the format the early notes followed has quietly drifted, and the note for partner thirty-one contradicts the rule you locked in at partner four. You ask the session to review its own output, and it passes everything, including the contradiction on your screen. Nothing crashed; the work outgrew the window, and you watched the quality fall in real time without a name for what was happening.

This chapter gives it a name, and it opens the part of The Frontier where you stop running one agent and start running several.

Where a single session breaks down

The decline is not random, and it shows up reliably once the work runs long. It comes from three pressures you can name.

Context runs out. Quality drops well before the window is full. Everything the session reads and writes shares one window, so the rule you set in the first minute stops being followed somewhere in the middle, and in the end new material no longer fits at all. Context engineering makes the window go further, but it does not remove the limit.
It runs one thing at a time. When the job is forty independent reads, doing them one after another spends real calendar time, and the longer the job, the longer you wait.
It cannot check its own work. A session cannot give its own output a hard, independent review, because the review runs in the same context that made the mistake. Whatever pulled the first answer wrong pulls the review wrong too, which is why the changelog audit signed off on its own contradiction.

The first pressure has a cure, and it is not a bigger window. Split the work and every piece fits in its own session, which also fixes the second pressure, since independent pieces can run at the same time. The third pressure is the deepest reason to run more than one agent.

A check is worth something only when it is independent of the thing it checks, and a fresh session is independent of your first one in a way a longer transcript can never be.

Whose agents this part is about

The Practice level was about the one agent your users touch: the session inside your product, the harm it can reach, and the human you keep in charge of it, argued in Supervision: keep a human in charge of the agent. This part changes whose agent we mean. The fleets here work for you, the builder: research sweeps, review panels, background coding runs, work you fan out in the morning and judge in the afternoon. The designs are general and carry over to the day agents cooperate inside your product, but from here on you are the one running them.

When not to orchestrate

Most work should not be orchestrated. A single session plus the kill rule wins whenever the task fits one window or is so tightly coupled that every step needs to see every other step, and those two kinds cover most of a working backlog. Orchestration gives you scale and independence, and it charges for them in coordination, debugging, and tokens: the work has to be split and briefed, the results have to be merged, and a failure no longer sits in one transcript you can scroll but hides in whichever session produced it. When one session with a clean restart can do the job, it wins on speed, cost, and ease of debugging.

The strongest case against fleets

Take the counterargument seriously before you spend a token on coordination. Cognition, the company behind the Devin coding agent, published the essay "Don't Build Multi-Agents" in 2025, and it is the best statement of the case. Every action an agent takes carries implicit decisions, and when you split a tightly coupled task across agents, those decisions conflict: two agents building halves of the same feature each answer questions nobody wrote down, an error format here, a naming scheme there, and merging them becomes painful. For work like that, the essay argues, everything should stay in one thread where every decision is made in view of every other.

The argument is correct, but it only covers one kind of work, so it does not settle the question. What settles it is whether the pieces are independent of each other.

Tightly coupled pieces belong in one thread. If the halves keep needing each other's half-made decisions, splitting them into separate sessions creates conflicts, and Cognition is right.
Independent pieces split apart cleanly. Anthropic's 2025 engineering account of its multi-agent research system is the proof: one orchestrator breaks a research question into parts, several subagents search at the same time without talking to each other, and a final pass combines the findings, with a reported large gain over a single agent on internal research evals at a much higher token cost. Search is independent work, which is why the design holds.

The math also shows why a long chain of steps was never the answer, whichever side of the debate you take. A chain of twenty steps, each 95 percent reliable, succeeds end to end only about a third of the time, because the small failure rate adds up at every link. One long session runs that math inside a single window, and agents chained one after another run the same math with handoff losses on top. The way out is to run pieces in parallel: cutting the work into independent pieces keeps every chain short, and a failed piece can be retried on its own instead of spoiling everything after it.

What this part covers

The chapters come in the order the decisions arrive. Decomposition: split work so agents never collide teaches the cut itself, and Patterns: fan-outs, pipelines, and judge panels gives the patterns to build once the cut is made. Verification: make the fleet check its own work builds independent checking in, and Economics: what a fleet costs and when it pays prices the idea honestly. Failure modes: how fleets go wrong together covers the ways coordinated agents fail at once, Supervision: stay in command of many agents scales your attention to match, and the part closes when you write your Orchestration Plan and run your first fleet.

Try it now

This drill takes about 15 minutes and produces the decision card you will reuse across the part.

Pull three tasks from your backlog. Choose real candidates you would hand to an agent this month: one small, one substantial, and one you keep postponing because it feels too big for one session.

Ask each task the same questions. Could two strangers each take half and finish without talking to each other? Is any piece irreversible, such as a migration, a customer-facing send, or a production deploy? Would one bad early step poison everything that runs after it?

Write the verdict, solo or fleet, with one sentence of why. Halves that would need to talk, an irreversible piece, or a bad early step that poisons the rest all argue for a single supervised session, while clean halves made of reversible pieces argue for a fleet. The sentence of why matters more than the verdict, because it says which of those reasons your decision actually rests on.

Keep the card where you plan. Each verdict is a prediction, and the chapters ahead give you the tools to test it, starting with where to make the cut.

Chapter Summary

A single session has a ceiling, and on long work its output gets worse in ways you can watch happen.
Three pressures push it past the ceiling: the context window fills, the work runs one step at a time, and the session cannot give its own work an honest review.
A bigger window does not fix this. Splitting the work into pieces that each fit their own session does, and it lets the pieces run at the same time.
A check only counts when it is independent of what it checks, and a fresh session gives you that independence. This is the deepest reason to run more than one agent.
Most work should stay with one agent. Reach for a fleet only when the work is too big for one window, splits into pieces that never needed each other, or needs an independent check.
Tightly coupled work belongs in one thread, where splitting it would create conflicts. Independent work, like a wide search, splits apart cleanly and runs in parallel.
For the tasks on your card that need a fleet, the next question is where exactly to cut, which is the work of Decomposition: split work so agents never collide.

Sources

Cognition (2025). Don't Build Multi-Agents. Cognition engineering blog.
Anthropic (2025). How we built our multi-agent research system. Anthropic engineering blog.

Marks this chapter complete on your course map. Reaching the end does this for you.