Receipts and recovery: design for the failed run · The Builder's Stack

When inputs attackJudge the whole run

Your assistant was asked to close out a renewal: file the signed invoice, schedule the review call, draft the reply, send the follow-up, and update the account record. The user comes back to a red toast that says the run failed. It does not say that the review call is already on the customer's calendar and the follow-up is already in their inbox, or that the record was never touched. The user learns those facts the slow way, by opening the calendar, the sent folder, and the CRM to audit their own product by hand. Then they press the one button the toast offers, retry, and the follow-up goes out a second time.

Every run ends in one of three states: it finished, it failed before changing anything, or it failed partway and left real changes in the world. Demos show the first state, ordinary error handling covers the second, and the third state is where a user decides whether your product can be trusted to take actions on their behalf. This chapter is about designing for that third state deliberately, because it is the one that decides trust.

Record every action where the user can see it

A product that answers owes the user an answer, and a product that acts owes the user a ledger: a running list of what it did.

Every action gets recorded as a line a person can read: what was done, where, when, and whether it can be taken back. The lines are written in plain user language, like "filed invoice.pdf" or "scheduled the review call for next week," never as tool names and JSON arguments. The ledger is visible without the user asking for it, attached to the run where they already are, never something they export on request.

Working memory: keep the session on the screen established that the product carries the session state because the human cannot. Once the product acts, the state worth carrying is no longer context and drafts but changes to the world, and the session ledger becomes an action ledger.

A receipt matters most on the worst day, when the ledger is the only honest account of how far the run got, and on good days it is what a reviewer reads before approving. GitHub's Copilot coding agent shows this at full size: you assign it an issue, it works in the background, and it returns a pull request, so the record of what it did and the place where you approve it arrive as one object.

A retry must never double-send

The most pressed button after a failure is retry, so what a second attempt does to the first attempt's work is a design decision, not an accident. The property you want has a name, idempotency: an action built so that running it twice has the same effect as running it once. You get this by giving each send a key tied to the attempt instead of the click, so a repeat with the same key folds back into the original instead of firing again. Guard the booking, the charge, and the record write the same way.

Retries also arrive with nobody pressing anything. Timeouts fire, queues redeliver, and the agent's own loop retries failed steps by design, which is the recovery behavior you want from it. Without idempotent actions, every one of those retries is a chance to double-book or double-charge, and the user in the cold open mailed their customer twice for pressing the button you gave them.

A half-finished run reports exactly what finished

Treat partial completion as its own valid state, not as something to round down to failed. A run that completed three of its five actions reports exactly that: these three are done, each with its ledger line open to inspection, and these two never started. If you collapse that into a single failed flag, you erase the difference between nothing happened and almost everything happened, and those two situations need opposite cleanup work.

Interruption falls under the same contract. Whether the user hit stop, the process crashed, or a timeout fired, the run resolves to one of two ends.

Resume. The run picks up from the last completed line, and the ledger says what remains.
Roll back. The completed lines are compensated in reverse, and the ledger says what was unwound.

What a run never does is stop in limbo, half-applied and silent.

Anxiety: lower the stakes at risky moments showed that anxiety peaks at maximum uncertainty and spends from the same pool of attention the cleanup is about to need. A user discovering a half-finished run with no account of it is that case exactly, and the receipt is the intervention, because it converts unknown changes somewhere into a short list with an undo column, which is a problem a person can work.

Build undo as a real action, not just a button

For every write tool on The action surface: every tool is delegated authority, decide in advance what it takes to put the world back. Database engineers named this pattern sagas decades ago: you split long work into steps, and for each step you write a matching action that undoes it. Undo in an agent is that same pattern with a user interface on top, one undo action per write tool, designed when you build the tool rather than improvised after an incident. These undo actions come in three grades.

Exact. The filed document is unfiled, the record is restored to its prior value, and the world reads as before.
With residue. The cancelled event leaves every calendar, but the invitees saw it, so the undo ships with an acknowledgment of what lingers.
None. The sent email and the completed payout have no compensating action, and the ledger row says so.

The failure at the last grade is undo theater: a button that sits there even though no real undo action exists behind it, tidying up the interface while the email still sits in the customer's inbox. That is worse than offering no button at all, because it teaches the user that the stakes were lower than they really were. Gmail's send feature shows the honest version, where undo is just a short delay before the message goes out and the button disappears once that window closes. When an action is final, the receipt labels that row final, and that label should come from the reversibility inventory you wrote in When your product starts doing things, not from a user finding out the hard way mid-failure.

This label matters beyond the receipt. The middle rungs of The autonomy ladder: place every action deliberately, where the agent acts first but keeps an undo within reach, only make sense for actions whose undo is real, so mislabeling a row as undoable lets the agent act at a level of autonomy it should not have.

Try it now

The drill runs on your own agent feature, real or planned, and produces the worst screen your product can show, on your schedule instead of a customer's.

Kill a run mid-task yourself. In a staging or test environment, start a multi-step job and kill it right after the first write lands: stop the process, cut the network, or use your own stop control if one exists. If the feature lives in a repo, have Claude Code add a fault flag that aborts the run after the first successful write. If the feature is still a plan, walk the flow on paper and place the death one step after the first write.

Write down exactly what the user would see. Open the product as the user at that moment and record its account of events next to the truth. Make two columns, what actually changed in the world and what the product admits, and every row missing from the second column is state the user can only discover by opening other apps.

Fix the worst gap first. Rank the missing rows by the cost of not knowing, and repair the most expensive one. When we run this drill on our own builds, the winner is usually a run reported as cleanly failed while its writes sit live in the world, and the fix is a ledger line and a partial-completion state rather than a rebuild.

Chapter Summary

An agent that takes actions owes the user a ledger: a running list of what it did, in plain user language, visible without anyone asking for it.
Label every line on that ledger as either undoable or final, so the user always knows what they can take back.
Plan for the run that fails partway through and leaves real changes behind, because that is the case demos skip and the one that decides whether users trust your product.
Make actions idempotent so a retry folds into the original instead of running again. Without this, every retry can double-book or double-charge.
Report a partial run honestly: say which actions finished and which never started, instead of rounding the whole thing down to failed.
Build a real undo action for each write tool, designed when you build the tool, and do not show an undo button where no real undo exists.
When an interrupted run cannot finish, resolve it to either resume or roll back, and never leave it stuck halfway with no account of what happened.
The same ledger that calms the user's worst moment is also a record of what the agent did, step by step. That record gets put to work in Agent evals: judge the trajectory, not just the answer.

Sources

Garcia-Molina, H., & Salem, K. (1987). Sagas. Proceedings of the ACM SIGMOD International Conference on Management of Data.
GitHub, announcement and documentation of the Copilot coding agent (2025).
Product documentation: Gmail Help on Undo Send.

Marks this chapter complete on your course map. Reaching the end does this for you.