Skip to content
AI-Native PM
7 min · 0 of 8 in Orchestration

Supervision: stay in command of many agents

You tiled six terminal windows and for a few minutes read them all. Then lane two stalled on a test suite, lane five streamed a directory listing, and your attention settled into a glaze that looked like watching. Somewhere in there, lane four printed one line about rewriting the shared changelog, a file outside its slice, and the scroll carried it away. You learned of it an hour later, from the merge conflict. With one agent, watching the transcript was supervision; with six, watching only looks like supervision.

The rule has not changed since Supervision: keep a human in charge of the agent settled who is in charge of one agent, back when supervising meant reading the transcript, and reading does not stretch across six lanes.

Once you can no longer read everything, supervision stops being something you do and becomes something you design: points where the run waits for you, one picture of health, scheduled reads, and a stop that always works.

The job moves from operator to supervisor

This seat has a name, supervisory control, studied for decades wherever automation does the work and a human oversees it. Control rooms and cockpits settled on the same job: set the limits before anything moves, watch a short summary instead of the raw feeds, read the detail on a schedule, and step in when the summary crosses a line. Staying alert and reading samples are the job itself, and the rest of this chapter applies that to your fleet.

Review when a stage finishes, not while it runs

A phase gate is a point where the run stops and waits for your decision, placed wherever something complete exists to judge: after the work is split up, before anything spawns; after the drafts are in, when every item has returned and nothing has merged; right before publish, before the result leaves your machine. At each gate you review one finished artifact (a plan, a stack of returns, a final diff), and between gates the fleet runs without you. Checking in mid-run buys little, because half-finished agent work is no easier to judge than half-finished human work, and it scrolls past faster.

Wherever else you place gates, two belong on every run. One sits before anything you cannot take back: a merge into shared code, a customer send, a deploy, a deletion. The other sits before anything expensive: the widest fan-out, the strongest-model judge pass, anything that multiplies tokens by the number of workers. A plan with no gates is claiming the run can neither destroy anything of value nor spend real money, and at least one of those is almost always wrong.

GitHub's Copilot coding agent is a good example of this gate in practice. It takes an assigned issue, works in the background, and returns a pull request, so the human reviews a finished diff with tests and a description rather than watching it type.

What you check between gates

Between gates the fleet should not need you, but you still need to know it is healthy. Keep one status picture the run keeps current, limited to what a glance can absorb.

  • The counts. Done, running, failed. Failed only protects you if recorded honestly: a worker that exits without its return is failed, not quiet progress.
  • The phase. Which stage is live and which gate comes next, so you know whether a decision is minutes or hours away.
  • The last anomaly. The freshest deviation from plan: a retry, a failed check, a write that bounced off a guardrail. History belongs in the log, not the glance.

Everything else is detail; transcripts are for sampling, not for reading. This is the discipline from Monitoring, how you know it broke (and what it costs) pointed at a run instead of a product.

Read a few full transcripts at random each run

The status picture tells you the run is moving, not that the work is any good, and the gates only show finished artifacts. Reading a few transcripts in full covers that gap. Each run, pick a couple and read them start to finish, the way a manager who skips a level reads the raw work instead of the summaries. Rotate which lanes and roles you read, the judge and the dullest formatter included, so no worker is exempt. A full read checks three things:

  • Does it match the picture? The transcript should agree with the status line. A lane marked done on top of a silent failure tells you your reporting is broken, not just that one lane.
  • Did the worker stay in its lane? It should have touched only what it was allowed to get, return, and write. The changelog rewrite from the opening is what a sample catches while it is still one lane wide.
  • Would your gates have caught a bad version of this? Given what the transcript shows, your gates should still catch this lane's worst likely failure. Where they would not, a gate is missing or misplaced.

Dull samples still pay off, because each one tells you how far the status picture can be trusted. The alarming ones are usually the failures that hit several lanes at once, cataloged in Failure modes: how fleets go wrong together, and a sample catches them while they are still one lane wide.

Keep one switch that stops everything

Abort authority is one action that stops the whole fleet mid-run, no questions asked. The kill rule made stopping one session cheap, and the fleet version has to stay that cheap, because a stop that takes real effort loses out to letting the run finish in the moments it matters most. Say out loud how you would stop everything this minute. If the answer involves hunting through process lists, closing windows one at a time, or working out where a half-applied change landed, you do not have abort authority; you have a cleanup procedure.

Stopping stays cheap because of choices you already made. Workers only write inside their own territory, and every irreversible step sits behind a gate, so an aborted run is a folder of drafts and the only loss is the tokens already spent. Decide the triggers ahead of time, such as a failure count past your line, a write outside any territory, or spend past the cap, so pulling the switch is recognition rather than an argument mid-incident.

Try it now

The drill is a design pass plus one real run; the gate plan it produces feeds your Orchestration Plan.

Take out your cut-list. The fleet you cut in Decomposition: split work so agents never collide is the one you now put under command: items, contracts, merge step.

Place the two gates. Mark where irreversible lives (merges, sends, deploys, deletions) and where expensive lives (the widest fan-out, the judge pass). Gate immediately before each, with one line: the artifact you review, and what must be true to open.

Write the one-glance line. Draft the status line you want mid-run: counts, phase, last anomaly. In Claude Code, tell the orchestrating session to append it to a status file after every item; the instruction works in any harness that writes files.

Write the abort line. Name the literal stop (keystroke or command) and the triggers that fire it: the failed count you will not tolerate, a write outside any territory, spend crossing the cap from Economics: what a fleet costs and when it pays.

Schedule exactly one mid-run sample. Before the run starts, pick the moment (done count crossing half is a good default) and the lane, by rotation or dice. Read that transcript end to end when the moment arrives, however healthy the counts look.

Run it and grade the plan. Hold the gates, glance between them, take the sample on schedule. Afterward write one verdict per control: did the gate artifacts support real decisions, did the sampled transcript match the status line, what came closest to abort. Keep the sheet with the cut-list.

Chapter Summary

  • Once you run more agents than you can read, watching every transcript stops being supervision, so you design the supervision into the run instead.
  • The job is supervisory control: set the limits up front, watch a short summary, read the detail on a schedule, and step in when the summary crosses a line.
  • Review when a stage finishes and a whole artifact exists to judge, not while the work is still running and scrolling past.
  • Put a gate before anything you cannot take back, and a gate before anything expensive; every run needs at least those two.
  • Between gates, keep one status picture you can take in at a glance: the counts, the current phase, and the last thing that went wrong.
  • That picture tells you the run is moving, not that the work is good, so read a few full transcripts at random each run, rotating lanes and roles so no worker is exempt.
  • Keep one switch that stops the whole fleet at once; if stopping means hunting through windows and processes, you have a cleanup procedure, not a real stop.
  • The better a fleet runs, the less you will want to check it, so name the sample in the plan ahead of time and take it whatever the counts say.
  • Next, everything this part built comes together when you write your Orchestration Plan and run your first fleet.

Sources

  • Sheridan, T. B. (1992). Telerobotics, Automation, and Human Supervisory Control. MIT Press.
  • Parasuraman, R., & Riley, V. (1997). Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors, 39(2).
  • Parasuraman, R., & Manzey, D. H. (2010). Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors, 52(3).
  • GitHub documentation on the Copilot coding agent (2025).
Marks this chapter complete on your course map. Reaching the end does this for you.