Your first production fan-out comes back in minutes, every slice merged and the diff clean, the kind of result a solo session needs days to match. Then the usage dashboard loads, and those few minutes have spent roughly ten solo sessions' worth of tokens. Nothing misfired, because each of the ten agents read the shared brief, pulled the files for its slice, and paid full price for every token. A fleet's bill is tokens times agents, the multiplication is the point, and this chapter makes that multiplication a decision instead of a discovery.
Speed and cost are two separate budgets
A fleet's bill is tokens times agents, because every agent pays full price for every token it reads, so running ten agents costs about ten times what one agent costs.
Running agents in parallel saves time and nothing else. Ten agents on ten independent slices finish close to ten times sooner, but each one separately reads the system prompt, the shared brief, and the files its slice touches, so the meter runs about ten times faster too. Parallel work buys you a shorter calendar, and it never buys you cheaper tokens.
Coordination then adds charges of its own. Every handoff makes an agent re-read context: the orchestrator reads each worker's report, a reviewer re-reads what a worker already read, and a back-and-forth between two agents bills both ends. Brooks's law described the calendar version of this overhead half a century ago: the number of communication paths rises roughly with the square of the number of people, which is why adding people to a late project makes it later. In a fleet every one of those paths is metered, so the overhead that once cost weeks now shows up itemized in tokens.
None of this is an argument against fleets; it just puts a price on them. One published account of a multi-agent research system, where an orchestrator splits a research question into parallel searches and then synthesizes the results, reported a large improvement over a single agent on internal evals and was just as clear that those runs cost a lot more tokens. That is the real trade you are making: better results on a shorter calendar, bought with a bigger bill. The run is worth its cost only when the output justifies it, which is the cost question from the playbook asked again at fleet scale.
Set a spending cap that can stop the run
A cap only protects you if it can stop the run while the run is still going. A number you check after the fact is a postmortem, not a control. Set the cap in a unit your tooling can act on, tokens or currency, and wire it somewhere that can actually halt work: a hard limit in the orchestrator, a spend ceiling at the provider, or an alert that interrupts you mid-run. This is the money discipline from Guardrails: keep secrets, money, and data safe pointed at the meter that moves fastest.
Decide before the run what the cap should cut, and have it cut new work, never the checking of work that already exists. Spawning new work, meaning another search branch, another candidate fix, another draft, scales with appetite and will soak up any budget you give it. Checking scales with what you have already produced, and cutting it turns tokens you already spent into risk you ship, because the fleet checking its own work is what turns a pile of output into something you can use. A pile nobody checked is worse than a smaller pile that got checked. Write the rule down before the run starts, in the same breath as the cap: when the cap is hit, no new work starts and the checking finishes on everything that already exists.
Use cheap models for the simple jobs
After the number of agents, the biggest lever on the bill is which model you put in which role, because there is no reason for every seat in a fleet to run the same model. Match each role to how much judgment it actually needs.
- Simple jobs take a small, fast model. Formatting output, scanning a repository for matches, and pulling fields into a schema all produce results you can check at a glance and redo cheaply, so the cheapest model that passes your spot checks earns the seat.
- Drafting jobs keep your everyday model. The model you already trust on your own is the one that should implement and write inside the fleet.
- The judge job takes the strongest model you have. A judge ranks candidates, reconciles conflicting reports, and passes or fails results against the brief, so the judge panels funnel every later consequence into one choice. A weak judge quietly turns good work into bad picks, and you may never see where it went wrong, so the judge is the last seat we ever try to save money on.
Your provider's pricing page keeps the exact ratios current; the part that stays stable is that small models run at a small fraction of flagship rates, so moving the simple jobs to a cheap model is the first cut you can make that costs nothing in quality.
Buy more checking only where a mistake is expensive
Extra checking is the other line item you control directly. Running rival drafts and separate agents whose job is to find the flaw raises your odds of catching a wrong answer, and each one adds the cost of re-doing the slice it checks. So treat extra checking the way you treat insurance, and buy it against the cost of the mistake. A migration that deletes data, an announcement going out to customers, anything you cannot take back or that the public will see, earns three separate checkers. A throwaway draft you plan to rewrite by hand earns none, because your own read is the only review it needs. Most work falls in between and gets one checker by default. If you are spending the same on checking across every slice, you are over-insuring some of them, and the riskiest slice is probably under-insured.
Send work that can wait to the batch tier
Everything above prices the fleet at real-time rates, and real time is the most expensive way to buy tokens. The major providers run batch tiers where you submit a pile of requests and collect the results within a window measured in hours, at a discount on the order of half; some discount off-peak hours on top of that. The numbers move often enough that the only place to read them is your provider's current pricing page, but the basic pattern stays stable.
This fits any fan-out work nobody is waiting on: re-tagging a support archive, refreshing summaries across a document base, regenerating evaluation fixtures overnight. A parallel fleet spends extra tokens to finish sooner, and the batch tier lets you trade that speed back for a discount whenever a slice has no deadline. Splitting your queue by deadline is often the difference between a fleet you can afford every week and one you can only afford at launch.
Try it now
This drill spends no tokens; run it before your next big fleet rather than after.
Get your unit. Open your provider's pricing page for current rates on your flagship model and its smallest capable sibling, then open your usage dashboard and note what a typical solo session costs you. That session is your unit.
Price the solo baseline. Take the cut list you wrote in Decomposition: split work so agents never collide and estimate one agent working it end to end. Keep every estimate to an order of magnitude, the nearest power of ten, because the comparison needs scale rather than precision.
Price the fleet. Estimate one agent per slice, each re-reading the shared context, plus an orchestrator that reads every report. Write the total and the wall-clock next to the solo numbers, and notice which one moved in which direction.
Price the fleet plus verifiers. Add a verification pass per slice and a strongest-model judge across the results, with extra refuters only on the slices where a miss is expensive.
Set the cap and write the ceiling rule. Pick the number this work is worth to you, set it where your tooling can enforce it, and write the rule in one sentence: when the cap hits, spawning stops and checking finishes on whatever exists. Keep the estimates and the rule where your next run can see them.
Chapter Summary
- A fleet's bill is tokens times agents: every agent pays full price for every token, so ten agents cost about ten times one agent.
- Running agents in parallel buys you a shorter calendar, not cheaper tokens.
- Coordination costs extra on top, because every handoff makes an agent re-read context the work already covered.
- A spending cap only helps if it can stop the run while it is happening, so wire it where it can actually halt work.
- When the cap is hit, stop starting new work and let the checking finish on everything you already have.
- Put cheap, fast models on the simple jobs, keep your everyday model for drafting, and give the judge job your strongest model.
- Buy extra checking against the cost of a mistake: heavy on anything you cannot take back, none on throwaway drafts, one checker for most work.
- Send any work with no deadline to the provider's batch tier and take the discount.
- A priced fleet is not yet a safe fleet, because agents that share a model and a brief can be wrong in the same way at once, which is where Failure modes: how fleets go wrong together picks up.
Sources
- Brooks, F. P. (1975). The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley.
- Anthropic engineering blog (2025). How we built our multi-agent research system.
- Anthropic and OpenAI batch processing documentation and pricing pages (last verified June 2026).