Failure modes: how fleets go wrong together · The Builder's Stack

What a fleet costsSupervision at scale

The sweep you launched at lunch reports back within the hour: two hundred files audited for calls to the deprecated payments client, a tidy findings table, a recommendation to proceed. It reads like finished work, right up until you reconcile the counts. One worker hit a directory listing too large for its window, sampled twenty files as representative, and the merge inherited the cut without recording it. The report covers a tenth of what it claims, formatted exactly like a report that covers everything.

A single session fails in front of you, in the one transcript you are already reading.

A fleet fails between the transcripts, at the joints: in the merge, during a restart, in the handoff from one worker to the next, and the result usually arrives dressed as success.

These failures repeat often enough to have names, and each one comes with a tell you can watch for and a cure you build in before the run.

Silent truncation: the fleet covers less than it claims

The opening run shows the pattern. An agent hits work too large for its window or its budget, quietly shrinks the job until it fits, then reports the smaller job as if it had done the whole thing. Every finding it returned was real; it just stopped telling you how much it skipped.

The tell. The report never states its coverage. If it cannot say how many items were assigned, completed, and dropped, treat every claim of completion inside it as unverified.
The cure. Make coverage an explicit output. Each worker reports what it completed and what it skipped, with reasons; the merge checks assigned against completed before it summarizes; and dropped work prints at the top of the report. When a drop is written down you can look at it and decide whether it matters, but a drop nobody recorded stays invisible until whatever it hid finally breaks.

Compounding errors: the longer the chain, the lower the odds

The more steps you put in a row, the worse the math gets. A chain of twenty steps, each 95 percent reliable, succeeds end to end only about a third of the time, because every step feeds on the output of the one before it, so an early mistake gets built on and confirmed by everything downstream. Nothing looks risky while the chain runs, because each agent does a competent job on the input it was given.

The tell. Quality drops the further down the chain you look, with late outputs contradicting early decisions, and when the final result is wrong, no single transcript holds the mistake.
The cure. Keep chains short and verify at the joints. Cut long pipelines into short gated runs, so an error pays once where it appears instead of at every step downstream. The gates come from Verification: make the fleet check its own work; a joint nothing checks forwards errors at full speed.

Duplicate findings: the loop that keeps reopening settled cases

Parallel finders, whether a research sweep, a bug hunt, or a lead search, keep rediscovering the same items, because the strongest evidence is obvious from every seat. Duplicates are expected, and the gather step folds them together. The real trouble hides in what you compare new candidates against. If your ledger holds only the findings the judge confirmed, every item the judge rejected is invisible to the next round, so the finders bring it back in fresh wording, the judge rejects it again, and the loop keeps re-arguing closed cases instead of settling down.

The tell. Rounds keep running while the accepted list stays the same length, and the same finding comes back in slightly different words.
The cure. Compare new candidates against everything the fleet has already seen, not just the items it kept. The seen-ledger holds accepted, rejected, and pending items alike, and every new candidate is checked against all three first. Once you record a rejection, the fleet stops paying to re-argue it, because the next round recognizes the item and drops it instead of sending it back to the judge.

Convergent wrong answers: the whole fleet agrees and is still wrong

Condorcet's jury theorem, from 1785, is the math behind majority votes: when each voter is right more often than chance and the votes are independent, the majority verdict gets closer to certain as the group gets larger. The catch is that word independent. Agents that share a model, a prompt template, and the same sources do not vote independently, so whatever pushes one of them into a mistake pushes all of them the same way. The errors line up instead of cancelling out, and the fleet can be unanimously wrong.

The tell. The agreement comes too fast and too clean. Every judge in the panel you built from Patterns: fan-outs, pipelines, and judge panels cites the same evidence in the same order, and nobody ever dissents. If a panel never disagrees, you do not have independent judges, you have one judge copied several times.
The cure. Build in the independence the theorem assumes. Give each judge a different angle to look from (correctness, security, the user's experience), feed them different sources, or word the question differently for each one. A panel only adds accuracy when its members can actually disagree, so agreement reached down separate paths is real evidence, while agreement among copies of one judge tells you nothing new.

Stranded work: items left half-done when a run is cut off

Fleets run long, and long runs get interrupted: the laptop sleeps, a rate limit lands, or you kill the run to get your machine back. The interruption itself is routine, and the damage is in what it leaves half-finished. Items still in progress at the cut have no saved state and outputs that stop partway through a file, so a resumed run cannot tell finished work from abandoned work, and it either redoes the first or builds on the second.

The tell. A restart begins from the top, output files stop mid-sentence, or duplicates appear for items that ran twice.
The cure. Build the run to survive its own interruption before you launch it. Make each work item idempotent, meaning a repeat lands the same output in the same place, so running an item twice is safe. Pair that with a journal of completions, one line appended the moment an item's output lands and never before. A resumed run reads the journal, replays finished items as instant skips, and re-runs only the live ones.

Try it now

This drill takes about twenty minutes and proves the stranded-work cures on a run you control.

Stage a fan-out with a journal. Pick about thirty small, repeatable items you already trust an agent with: files to summarize, tickets to label, pages to check. Write the cut-list first, the item list you produced in Decomposition: split work so agents never collide, and save it where the run cannot modify it. Then brief the fan-out in Claude Code with standing rules: every item writes its output to a file named after the item, and the moment an output lands, a journal line records the item id and output path.

Kill it midway. Watch lines land in the journal, and somewhere past the halfway mark, deliberately kill the process with no warning and no cleanup pass. You are staging the rude interruption where it costs nothing.

Resume it cold. Start a fresh session with the same brief plus an added rule: read the journal first, skip every item recorded there, and run only the rest. Finished items should replay as instant skips, and the item that was mid-flight at the kill re-runs from scratch.

Prove the ledger. Count outputs against the cut-list: one output file and one journal line per item, nothing lost, nothing duplicated. An item with two outputs means your items are not idempotent yet, and a missing item means the journal line was written before the output landed. Fix the rule that failed and stage the kill again until the counts reconcile.

Scale it down: three items instead of thirty, killed right after the first completion. The proof counts the same at any size, and a journal that survives the small version survives the real one.

Chapter Summary

A single agent fails in the transcript you are reading; a fleet fails between transcripts, in the merges and restarts, and usually arrives looking like success.
Silent truncation: an agent shrinks a job to fit its window and reports the small job as the whole one. Make every worker state what it completed and skipped, and reconcile the counts.
The more steps you chain in a row, the lower the odds the whole chain succeeds. Keep chains short and check the work at each joint.
Parallel finders keep rediscovering the same items. Compare every new candidate against everything seen, rejections included.
A fleet sharing one model, prompt, and sources can agree unanimously and still be wrong. Give each judge a different angle or different sources so they can disagree.
An interrupted run leaves work half-done. Make each item safe to repeat and log it the moment its output lands, so a resumed run reruns only the rest.
Every cure here is built before the run starts; none of them replaces your judgment while it is live.
Next is Supervision: stay in command of many agents, on watching, sampling, and stepping in while the fleet runs.

Sources

Condorcet (1785). Essay on the Application of Analysis to the Probability of Majority Decisions. The jury theorem.

Marks this chapter complete on your course map. Reaching the end does this for you.