Verification: make the fleet check its own work · The Builder's Stack

Fan-outs and pipelinesWhat a fleet costs

The fan-out finishes at 4:40 in the afternoon, the six writer agents each hand back a competitor brief, and the orchestrator reports every task complete. You open the first brief and find a pricing model that company retired a year ago, described with the same even confidence as everything else. If one brief carries a confident error, any of them can, so you settle in to read all six, and somewhere in the fourth it sinks in that checking the batch by hand is taking longer than one careful session would have taken to produce it.

The fleet multiplied how much got produced but left all the checking with you. The fix is not to read faster, it is to make the fleet check its own work, and that rests on one fact about checking that is worth stating first.

Checking work is cheaper than producing it

Producing an artifact and checking one are different sizes of job: checking an answer costs far less than producing it, so a fleet that generates more can afford to verify more.

The two jobs are different sizes because they do different work.

Producing means searching for an answer. The writer weighs which sources to pull, which structure to use, and which of many adequate phrasings to keep, and most of that effort never reaches the final draft.
Checking means confirming an answer. The checker holds one finished artifact against a fixed list of questions: do the claims match the sources, does the format match the spec, does the code run.

This is the same reason reviewing a colleague's pull request is faster than writing the change yourself, and it is what pays for everything in this chapter. The standard you hold the work to has not changed. The verify-first rule still holds, and Review what the AI built is still the human version of that check. What changes at fleet scale is only who runs the check first.

Give verifiers the authority to fix, not just report

The working unit is a writer paired with a verifier that receives the artifact, the original task, and a checklist. Background coding agents made the same bet by delivering pull requests, where nothing lands without a review. The design choice that decides whether your verifier saves any time is what it may do when a check fails.

A verifier that only reports problems hands you a second job: its findings still need someone to read each one, send the valid ones back to the writer, and confirm the fixes landed, and that someone is you. So give the verifier permission to fix things instead. Small mechanical defects, the dead link, the unsourced claim, the format violation, get fixed in place and logged in one line each, and only the bigger structural problems, a wrong approach or a missed requirement, go back to the writer. What the pair hands you is a corrected artifact and a short change log.

Scope that authority like any other write access, to the artifact in front of the verifier and nothing else.

Tell verifiers to refute, not to review

When you ask a model to review finished work, it mostly comes back with approval and a few minor notes. When you ask it instead to build the strongest case that the work is wrong and report which parts held up anyway, the same model gives you the adversarial check you actually need.

One model arguing against the work is just a single opinion, so run several against the same artifact and keep the parts that a majority could not knock down. Work that survives several independent attempts to take it apart is sturdier than work one reviewer waved through, and the effect is well documented.

Self-consistency. Sampling several reasoning paths and keeping the majority answer outperforms a single pass through the problem. The effect carries over when the question is "does this hold" rather than "what is the answer".
Condorcet's jury theorem. When each voter does better than chance, the majority verdict approaches certainty as the panel gets larger, but only while votes stay independent, and correlated voters break it.

Independence is the part fleets quietly break. Failure modes: how fleets go wrong together shows how that happens, and the cheapest way to protect against it comes next.

Give each checker a different question to ask

Running the same check three times does not give you three opinions. Identical checkers share a model and a prompt, so they share the same blind spots, and their matching votes are really just one verdict repeated. Vary the question each one asks instead.

Is it correct? Read the artifact against the task and its sources, searching for claims that do not follow.
Is it safe? Check for leaked secrets, destructive commands, and data crossing a boundary you set.
Does it actually run? Execute it. Run the code, resolve the links, render the page, click the flow.

Because these three questions fail in different ways, the cases where they all agree tell you more than three copies of the same check ever could, and the only cost is writing the three prompts.

Remove duplicate findings before you verify them

Search-style fan-outs, the bug hunts and research sweeps, have a quirk worth planning for. Several finders running in parallel keep rediscovering the same item, so five agents sweeping a codebase hand you the same flaky test described five different ways. If you send that raw pile on to be verified, you pay over and over to confirm one finding, and the duplicates read like five separate confirmations when they are really one finding echoed five times. So put a cheap deduplication step between finding and verifying: collapse the exact matches, group the near-duplicates with a small model, and verify just one example from each group.

Scripts for mechanical checks, models for judgment

Before a model verifies anything, ask what a script could check instead. A few lines that grep for banned patterns, run the tests, validate the schema, and fail on dead links cost almost nothing and hold the last artifact of a long batch to the same standard as the first. Reserve model verification for checks that genuinely require reading, such as whether a summary is faithful to its source or a conclusion follows from the evidence.

Then check the verifier itself, because its own pass is work that nobody has verified yet. On a regular schedule, pull one artifact it marked as passing and review it end to end yourself, using the verification habits as the checklist for that read.

Try it now

Add a verifier stage. Reopen the fan-out you built in Patterns: fan-outs, pipelines, and judge panels and, in Claude Code, add one verifier agent that runs after the writers finish, taking each artifact in turn.

Arm it. Give it five mechanical checks (every claim carries a source, the format matches the spec, no banned terms from your style notes, every link and path resolves, length sits inside the band) plus one judgment call, such as building the strongest case that the central conclusion is wrong.

Grant edit authority. Tell it to fix failures in place and append a one-line log per fix. Run the fleet, read the log instead of the artifacts, then check one approved artifact end to end to calibrate how much trust the green marks have earned.

Swap one check for a script. Pick the most mechanical item, banned terms is the easy one, and have Claude Code write a script that performs the same check deterministically. Run both on the same batch and compare cost, agreement, and whether each returns identical verdicts on a second run. The script should win that check outright; keep the model on the judgment call.

Scale it down: run everything above against a single artifact instead of the whole set, the full checklist and the script swap included, end to end.

Chapter Summary

Checking an answer costs much less than producing one, so a fleet that produces more can afford to check more. Make it check its own output first, rather than leaving all the reading on you.
Pair every writer with a verifier, let it fix the small mechanical problems in place, and send only the bigger structural ones back to the writer, so you get back a corrected artifact and a short change log.
Ask each verifier to argue that the work is wrong rather than to review it, because a review mostly gets you a rubber stamp.
Run several checkers and keep what a majority cannot knock down, but only while their votes stay independent of each other.
Give each checker a different question, is it correct, is it safe, does it actually run, so they fail in different ways and their agreement means something.
When several agents keep finding the same item, remove the duplicates before you verify, so you do not pay to confirm one finding many times.
Use a script for any mechanical check a script can run, save model checking for things that truly need reading, and spot-check the verifier's own approvals on a schedule.
Next, Economics: what a fleet costs and when it pays works out whether all this checking pays off at your batch sizes.

Sources

Wang et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.
Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS Datasets and Benchmarks Track.
Condorcet (1785). Essay on the Application of Analysis to the Probability of Majority Decisions.

Marks this chapter complete on your course map. Reaching the end does this for you.