Run the human factors audit · The Builder's Stack

Keep a human in chargeThe demo isn't the product

In a randomized trial, experienced developers worked in their own codebases with AI tools and came away convinced the tools had made them about 20 percent faster, while the measured result was 19 percent slower. A product can feel helpful while it works against its user, and neither building it nor using it daily reveals the difference, because the difference only shows up under inspection.

Each chapter in this part ended with a recommendation you could ship. This chapter assembles them into that inspection, an audit you can run in about an hour on any AI product, yours or a competitor's, and finish with a written list of where the product works against its user. The recommendations come from our essay The Human Factors, and why AI products need human factors covers the science underneath them.

Audit the product, not the model underneath it

The same model now sits under many products, and each product puts it in front of a different person doing a different job. The human factors show up where that person meets that product, so there is nothing to audit at the model level.

You cannot audit a model. You can only audit a product in the hands of its user, because that is where the human factors live.

The Claude family shows the separation. One model runs the chat product, where a wrong answer costs a re-prompt and the controls stay light, and the same model runs Claude Code, where a wrong command can cost real work, so the product asks permission before it edits files or runs shell commands. The model is identical, so each product needs its own audit.

What you need before you start

A one-sentence scope. Name the user, the task, and the moment, for example "a support lead clearing the overnight queue before standup." If the product serves three distinct jobs, plan three audits.
About an hour, around ten minutes per station.
The product in a real state. A live session with real data, not a demo account or screenshots.
Two people if you can get them. One drives while the other records, because the driver adapts to flaws too quickly to notice them.
A written output. Every gap gets one sentence naming the failure and one naming the harm to the user.

Walk the six stations

Each station has a fixed question, a place to look, and a picture of what failing looks like, in the order of the chapters of this part.

Station 1, perception. The question is whether the most important thing on the screen is also the most visible thing. The eye commits to color, size, and motion before reading begins, the scan called pre-attentive processing. Pull up the riskiest screen the product can show, give a colleague a one-second look, and ask what they saw. Failing looks like a caveat set in the same gray as everything around it while a success badge glows green. The full case is in make the warning impossible to miss.

Station 2, working memory. The question is whether a user ten turns into a session can read its state in one pass. Working memory holds Miller's seven items, plus or minus two, while a session generates hundreds. Open a real session at turn ten and check where the plan lives, where changes are listed, and what survives a restart. Failing looks like state that exists only in scrollback and notes kept in a second window. The full case is in keep the session on the screen.

Station 3, anxiety. The question is whether the product steadies the user or rattles them just before an action they cannot take back. Worry and the task draw on the same limited attention, so a frightened user has less of it left for the decision in front of them. List every action that cannot be undone, such as send, delete, publish, spend, and deploy, and check each for a preview, a confirmation, and an undo. Failing looks like a bare irreversible action, or the opposite, confirmations on everything until people click yes by reflex. The full case is in lower the stakes at risky moments.

Station 4, mental models. The question is what the product, opened cold, teaches a new user about what it can and cannot do. The user's internal picture of how the system works, their mental model, comes from older software whenever the product teaches nothing, and that picture is usually wrong for AI. Open a fresh account, write down what the first screen literally communicates, and note whether limits appear next to highlights. Failing looks like capability discovered by accident weeks in. The full case is in show people what the system can do.

Station 5, metacognition. The question is what on the screen lets this user catch a wrong answer. Metacognition is the judgment you run on your own knowledge, and the sensemaking paradox holds that the people who most need an answer are the least equipped to evaluate it. Pick the three most consequential outputs and time how long each takes to verify using only what is on screen, whether a source, a diff, or a test. Failing looks like every answer arriving with the same flat confidence and no check short of redoing the work. The full case is in help people catch wrong answers.

Station 6, supervision. The question is where the outside check sits on every agentic path, meaning any flow where the model acts on its own instead of only answering. The model cannot reliably detect its own failures, so a check has to come from outside it, whether a human approval, a guardrail, or monitoring that pages a person. Find the most autonomy the product ever grants, then ask whether someone can see the plan before it runs and stop it in one step while it runs. Failing looks like the model being left to grade its own work. The full case is in keep a human in charge of the agent.

Rank the gaps by user harm, not by a score

The output is a list of gaps ranked by the harm each does to the user. A warning that does not register at a glance outranks an empty state that could teach more, because the first can cost someone their data and the second costs them a slower first week. Rank by what the failure does to the user, not by what the fix costs you, and negotiate the build order later, on the roadmap.

Trust what you watched over what anyone reports, including yourself, since the developers in the trial that opened this chapter were sincere and still wrong about their own sessions. Write down what you can actually see a user miss, hold in their head, or accept without checking.

Download the fillable audit and put it on the calendar

The whole review ships as the Human Factors audit, a downloadable PDF you fill in directly, typing into its fields and checking its boxes as you walk. Each station carries its question, its inspection steps, checkboxes for the common failure patterns, and room for the gaps you find, so the completed file is the deliverable you hand your team.

An audit run once decays, so give it a standing slot.

Quarterly on your own product, because products drift toward whatever each sprint optimized.
On each serious competitor, because it turns vague envy into a precise read of the gaps they leave open.
Before every major launch, because the cheapest time to find a bare irreversible action is before your users do.

Try this today: the 15-minute version

If the full hour is not on the calendar yet, run this slice.

Write the one-sentence scope naming the user, the task, and the moment.
Pull up the riskiest screen the product can show, give a colleague a one-second look, and write down what they saw and what they missed.
List every irreversible action and mark whether each has a preview, a confirmation, and an undo.

The slice usually finds the gap that justifies booking the full hour.

What comes next in The Practice

More of The Practice is on the way, with parts on evals, the tests you run on model behavior, and on agentic systems. Station 6, supervision, leads straight into both. The builder fundamentals under this part, such as context files, iteration discipline, and the verify-first habit, live in working with AI in Foundations.

Chapter Summary

The audit turns this part's six recommendations into one review you can run in about an hour and repeat on any AI product.
Audit a product, not the model under it. The same model behaves differently in front of different users doing different jobs, so the human factors only show up at the product level.
Start with a one-sentence scope naming the user, the task, and the moment. If the product serves three jobs, that is three audits.
Walk the six stations in order: perception, working memory, anxiety, mental models, metacognition, and supervision. Each one has a fixed question, a place to look, and a picture of what failing looks like.
The output is a written list of gaps, each one stating the failure and the harm it does to the user.
Rank the gaps by how much they hurt the user, not by what the fix costs you. Sort out the build order later, on the roadmap.
Trust what you watched over what anyone reports, including yourself, since people are sincere and still wrong about their own sessions.
Keep the result a ranked list of harms, never a single score, or the number becomes the goal and the gaps get hidden to improve it.
Run it on a schedule: quarterly on your own product, on each serious competitor, and before every major launch. The fillable Human Factors audit is the file you fill in as you walk.
This audit checks the experience around the answers. Whether the answers themselves clear a bar you can state and defend is the next part's question, which starts with why the demo is not the product.

Sources

Our essay The Human Factors, whose recommendations the audit assembles.
Treisman, A. (1986). Features and objects in visual processing. Scientific American, 255(5).
Miller, G. A. (1956). The magical number seven, plus or minus two. Psychological Review, 63(2).
Pessoa, L. (2009). How do emotion and motivation direct executive control? Trends in Cognitive Sciences, 13(4).
Norman, D. A. (1988). The Design of Everyday Things. Basic Books.
Pirolli, P., & Russell, D. M. (2011). Introduction to this special issue on sensemaking. Human-Computer Interaction, 26(1-2).
Becker, J., Rush, N., Barnes, E., & Rein, D. (2025). Measuring the impact of early-2025 AI on experienced open-source developer productivity. METR, arXiv:2507.09089.
Claude and Claude Code behavior verified against Anthropic's product documentation.

Marks this chapter complete on your course map. Reaching the end does this for you.