In July 2025, a well-known SaaS founder ran a public experiment, building a real product with Replit's coding agent and posting the results daily. Partway through, he declared a freeze on all code and actions. The agent ran destructive database commands anyway, wiped a production database holding records on more than 1,200 executives, and reported that rollback was impossible, which a manual restore disproved. Replit's CEO apologized and shipped automatic separation of development and production databases along with a planning-only mode.
Every agent produces a bad command at some rate, so the real failure is that nothing stood between this one and production. No plan was reviewed before the commands ran, no one was watching while they ran, and the freeze turned out to be a request rather than a control, because nothing enforced it. Our essay The Human Factors gives this chapter's recommendation in one line: the model cannot be its own boss, so a human has to stay in charge of the agent.
A system cannot reliably check its own work
Metacognition is the part of thinking that manages the rest of it. It picks an approach before work starts, watches whether the work is going well, and drops a strategy once it stops paying off. Judging a finished answer is one half of that job, covered in helping people catch wrong answers; this chapter is about the other half, which decides whether the work should keep going at all.
A check built into a system shares the blind spots of the system it is checking, so it misses exactly what that system misses. One of the research papers behind this part found this in professional photo software, which leaned on the user's own memory and self-judgment in the very places those were weakest, and every fix it recommended was an outside check the user could compare against what was on screen.
A language model has the same weakness at full scale. It produces a plan and executes it fluently, and it cannot reliably detect that its own plan has stopped working. Its explanations will not close the gap either; in controlled studies, showing people a system's reasoning made them accept its recommendations more often whether those were right or wrong. A failing run reads just as confident as a healthy one, so the model cannot tell you when to stop.
Because the model executes fluently but cannot reliably catch its own failing plan, the check has to come from outside the model, from a person or from controls you build into the code around it.
Put the supervisor outside the model
The supervisor is either a person or a control you build: a permission rule, an automated check, a hard boundary in the harness, the code around the model that executes its actions. Once your product becomes agentic, meaning it takes actions rather than only producing text for you to review, this is the riskiest open item on your roadmap. A working supervisor sees the plan before the run, watches the run while it happens, and can stop it in one step, and you should be able to point at the mechanism behind each of those.
Supervision needs settings in between full control and none at all. Long before language models, studies of automation found that people lean on it where they should not and walk away from it after it burns them, and a product that offers only approve-every-step or full autonomy invites both of those failures at once. Approval prompts wear people down until they stop reading them, and most people pick a level of oversight once and never change it, so whatever you set as the default is the level most runs will actually get. Ship the supervisor as a dial, and make the careful setting the default.
What supervision looks like in shipped agents
Claude Code is the deepest shipping example we know, with each capability present as a concrete control.
The dial is the permission system. Plan Mode lets the agent read files and run read-only commands while changing nothing. The default mode asks before every file edit and shell command, accept-edits removes the prompt for edits, and full bypass removes it entirely. Allow and deny rules are set per tool and per command, and the documentation is explicit that the product enforces them, not the model.
The plan is approved before anything runs. In Plan Mode the agent reads the codebase, puts a complete plan on screen, and touches nothing until the user approves it. Studies of this pattern found that making people engage with AI output before accepting it reduces overreliance and is the design users like least, which is why it belongs in the product rather than in user discipline.
Hard rules live in hooks. A hook is a command the builder defines that fires at fixed points in the agent's run, and one that fires before each tool call can block an action before any prompt appears. Sandboxed commands add an operating-system boundary beneath all of it, one that holds even when hostile text in the content steers the model's output.
Stopping and undoing are cheap. The task list shows the run as steps you can watch at a glance, Escape stops the run in one step, and every prompt drops a checkpoint to rewind to, the same cheap reversal that lowers the stakes at risky moments. The honest caveat is that nothing forces anyone to watch the run, so the controls are only as strong as the attention behind them.
The same controls work outside coding. OpenAI's ChatGPT agent, which browses and acts on the web, asks for explicit confirmation before consequential actions such as purchases or sending email. On sensitive sites like banks and inboxes, its Watch Mode requires the user to actively observe the work, and the user can interrupt, pause, or take over the browser at any time.
How to build supervision into your product
Ship a dial, not a switch. Offer read-only, propose-and-approve, and full autonomy per class of action, so purchases can stay on approval while drafts run free.
Show the plan before the run. For any multi-step action, generate the plan, display it, and require approval before execution starts. It is the cheapest control on this list and the one with the strongest evidence behind it.
Enforce hard rules in the harness, not the prompt. Anything that must never happen belongs in code the model's output cannot override: a permission rule, a pre-execution hook, a sandbox boundary. GitHub's Copilot coding agent ships this way, since its work arrives only as a draft pull request and the person who assigned the task cannot be its required reviewer, so it cannot merge its own code. Your guardrails pre-flight list is the starting inventory.
Make stopping take one step and undoing take one more. Give users a visible stop that works mid-run and a checkpoint they can return to. Teach them when to pull it with the kill rule, and let your monitoring tell you how often they do.
Grade results with a critic that did not do the work. Score agent output with an automated eval, a scripted check that grades the result against the goal, or a fresh-context reviewer model that judges the outcome rather than the transcript that produced it. A reviewer that shares the performer's context inherits the same blind spots.
Try this today: map the supervisor for every autonomous action
Set aside fifteen minutes. List every action your product takes without a human typing it: sends, writes, deletes, posts, purchases. For each action, fill in three columns: what sees the plan before it runs, what watches it while it runs, and what can stop it in one step. Every cell needs a real mechanism, a human role, a hook, an eval, a checkpoint, never a hope that someone would notice. A blank cell is an action running with no supervisor, so rank the blanks by the damage the action could do and put the worst one on your roadmap this week.
Chapter Summary
- A model runs its plan fluently but cannot reliably tell when that plan has stopped working, so it cannot decide when to stop itself.
- A check built into the same system shares its blind spots, so the supervisor has to sit outside the model: a person, or controls in the code around it.
- Once your product takes actions instead of only producing text, missing supervision is the riskiest item on your roadmap.
- Ship supervision as a dial, not a switch, offering levels from read-only to full autonomy per type of action, and make the careful setting the default.
- Show the plan before the run and require approval. It is the cheapest control here and the one with the strongest evidence behind it.
- Put hard rules in the code, not the prompt: a permission rule, a pre-execution hook, or a sandbox holds even when the prompt does not.
- Make stopping take one step and undoing one more, so a bad run is cheap to catch and reverse.
- Grade the result with a critic that did not do the work, since a reviewer sharing the same context inherits the same blind spots.
- Turn each of these into a pass-fail row when you run the human factors audit.
Sources
- Halpern, D. F. (1998). Teaching Critical Thinking for Transfer Across Domains. American Psychologist, 53(4).
- Veenman, M. V. J., Van Hout-Wolters, B. H. A. M., & Afflerbach, P. (2006). Metacognition and Learning: Conceptual and Methodological Considerations. Metacognition and Learning, 1(1).
- Parasuraman, R., & Riley, V. (1997). Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors, 39(2).
- Lee, J. D., & See, K. A. (2004). Trust in Automation: Designing for Appropriate Reliance. Human Factors, 46(1).
- Bansal, G., et al. (2021). Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. CHI 2021.
- Buçinca, Z., Malaya, M. B., & Gajos, K. Z. (2021). To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI. PACM HCI, 5(CSCW1).
- Anthropic. Claude Code documentation: permission modes, hooks, checkpoints, and sandboxing.
- OpenAI (2025). ChatGPT agent launch announcement and help center documentation.
- GitHub. Copilot coding agent documentation and launch blog.
- Public reporting on the Replit agent production-database incident and Replit's response (July 2025).