When your product starts doing things · The Builder's Stack

Build your evalWhat your agent can do

For the past year your support assistant has drafted replies. It reads the ticket, pulls the order history, writes the response, and leaves it in a review queue where a person approves, edits, or discards before a customer sees a word. On its worst day the drafts are clumsy, the reviewers grumble, and each failure costs a few minutes of someone's attention.

This quarter the assistant gets the send button. Same model, same retrieval, one new permission, and now it looks up the order, issues the refund, sends the reply, and closes the ticket on its own. On its worst day it refunds the wrong order, mails one customer's details to another, and closes tickets it never resolved, at whatever pace the queue arrives. The model did not get worse between those two days; what changed is how much a single mistake is now allowed to touch.

Granting that permission is the loudest project in the industry right now, and most roadmaps decide it with less care than a pricing change. This part of The Practice level is about deciding it deliberately.

The four parts that make a feature an agent

The vocabulary around agents sounds more complicated than the machinery actually is. Read the loop above against your own feature and the moving parts come down to these.

A goal instead of a prompt. You stop specifying the next output and start specifying a finished state, not "draft a reply to this ticket" but "resolve this ticket within policy." Every turn of the loop exists to close the distance to that state, which is why a sloppy goal costs more than a sloppy prompt.
A loop of plan, act, observe. The model proposes a step, a tool runs it, the result comes back, and the next step is proposed with that result in view, until the goal is met or the run stops. The same loop that lets an agent recover from a failed step lets a wrong turn compound quietly for many steps.
Tools as the hands. A model by itself only produces text. Every ability to do something, query the order system, issue the refund, send the email, is a tool you chose to wire in, so the tool list, not the prompt, is where the product's power and risk live.
Receipts for the human. Every pass through the loop leaves a record of what was tried, what came back, and what happened next. Those records are how a person supervises, audits, and unwinds an agent's work, and Receipts and recovery: design for the failed run treats them as a design object of their own.

The cost of a mistake changes when the product acts

A product that answers keeps a person between the model and the world. The model produces text, a human reads it, and only human hands touch anything real, so a wrong answer costs a correction: an edit, a retry, an apology in the next message. The error never leaves the review lane.

A product that acts moves the person out of the middle, and a wrong output becomes a wrong event: money moved, a message delivered, a record overwritten. The cost stops being the correction and becomes whatever the action touched, plus the cleanup and the lost confidence of whoever it touched. Speed multiplies it, since the loop runs as fast as the queue feeds it, and an error rate you shrugged at in drafts becomes a steady stream of wrong refunds. So the question to ask before shipping an action is not how often the system gets it wrong, but how much each wrong action costs once you multiply it by how often they happen.

Agency is authority you hand to the system

Agency is not a property of the model; it is the authority you choose to hand over.

Agency is the set of actions you have decided the system may take on a user's behalf, and each one carries its own cost of being wrong.

That makes it a product decision, and you make it one action at a time rather than once for the whole feature. You cannot design an answer to "should the assistant be agentic," but you can answer "may it send the reply without review, may it issue the refund, may it close the ticket," and those questions get different answers.

Between a system that only proposes and one that acts with nobody watching sits a ladder of positions, asking first, acting with an undo window, acting and notifying afterward, and every action deserves a chosen rung rather than an inherited default. The autonomy ladder: place every action deliberately spends a full chapter on that placement and how an action earns promotion.

Most features are better off assisting than acting

The pressure to ship the acting version is real. Agents that operate real interfaces the way a person does, browsing, filling forms, completing purchases, are shipped products now rather than research previews, with Anthropic's computer use arriving in late 2024 and OpenAI's Operator following in early 2025. What the system can technically do is no longer the limit, but the cost of a mistake has not changed.

Meanwhile the agent-building guidance the major labs publish is strikingly conservative: start with the simplest thing that works and add autonomy only where it earns its keep, which makes the full goal-driven loop a last resort. Most features deliver most of their value just by assisting, drafting the reply, proposing the step, prefilling the form, while a person stays in place to press send.

Letting the product act earns its place in narrower territory: work where reviewing the output costs as much as doing the job, volumes no reviewer could keep up with, or value that only exists at machine speed. Even there, the system earns autonomy rather than starting with it, the way a new hire does, by building a track record someone has reviewed. Your eval vocabulary already describes that promotion: a quality bar written for the action, test cases that include the rare and ugly inputs, and The regression gate: no change ships blind applied to permissions instead of answers.

Everything in this part is a single agent acting inside your product on your user's behalf. Fleets of agents working for you as the builder are a different discipline, and they live in The Frontier, beginning with Why one agent stops being enough.

Try it now

The drill takes about fifteen minutes, runs on your agent feature, real or planned, and produces the working document for this part.

Inventory every action. List everything your product could do on a user's behalf: every send, charge, refund, post, write, delete, booking, and update, including those hidden inside integrations. If the feature lives in a repo, ask Claude Code to enumerate every tool definition, API call, and webhook that writes, sends, or spends, then add what it missed. If the feature is still a plan, list the actions the plan implies.

Mark each action reversible or irreversible. For every row, ask whether another action can fully undo it, who can do that, and how fast. A draft is reversible, a delivered email is not, and a refund that depends on someone else's support queue is irreversible for design purposes. Mark borderline cases irreversible and note why.

Keep the list. Save it next to your quality bar. The next chapter reads every row as authority you are delegating, the ladder chapter assigns each row a rung, and the capstone turns the inventory into the charter your agent ships under.

Chapter Summary

An agent is four parts: a goal to reach, a loop of plan, act, and observe, tools that let it do things, and records a person can read afterward.
The model does not get more dangerous when you make it an agent. What changes is how much a single mistake is allowed to touch.
When a product only answers, a wrong output costs a correction. When it acts, a wrong output costs whatever the action touched, plus the cleanup and lost trust, and speed multiplies that because an agent acts as fast as work arrives.
So judge an action by what each mistake costs once you multiply it by how often mistakes happen, not by the error rate alone.
Agency is not something the model has; it is authority you choose to hand over, one action at a time.
Most features deliver most of their value by assisting, drafting and proposing while a person presses send. Letting the product act on its own is the exception you justify, not the default.
An agent demo almost always shows the happy path, so treat a run that worked in front of a room as a rehearsal, not a track record.
Keep the action inventory from the drill, because every chapter ahead builds on it.
Up next, The action surface: every tool is delegated authority takes that inventory first and treats each tool as a contract about what your product may do.

Sources

Published agent-building guidance from major AI labs: start with the simplest approach that works, add autonomy only where it earns its keep (2024).
Anthropic, announcement and documentation of computer use (late 2024).
OpenAI, announcement of Operator (early 2025).

Marks this chapter complete on your course map. Reaching the end does this for you.