Living document · one manual, every level

The Playbook

Short, hard-won rules for working with AI: what to put in front of the model, how to verify what comes back, and when to stop and reset. Open it when a build fights back.

Show

Working with the model

Getting good output by method, not luck: context, the right tool, and the loop.

Context engineering

The Fundamentals

Bad output is almost always a context problem, not a prompt problem. When a result disappoints, ask what context would have made the request answerable, then provide it at the right layer.

Project-level context is a brief file the tool reads before it acts. Claude Code reads CLAUDE.md at the project root, and the open AGENTS.md convention is read by other tools, including Codex. In claude.ai, ChatGPT, or Gemini, the same brief goes into the project instructions. What belongs in it:

What this is. What you are building and what a successful result does, in a paragraph.
Who it is for. The user, and what they must never experience: exposed data, surprise emails, lost work.
Hard constraints. The stack, what is out of scope, and what the tool may not change without asking.
Voice. The conventions that do not show up in the code itself: how copy sounds, which patterns you use, which you avoid, with an example of each.

Task-level context is the three sentences before the ask. State where we are (the relevant files and what currently happens), what done looks like (the observable behavior that ends the task), and what must not change (the constraint this task could break). Attach the file you mean and an example to match; "like this one, with these differences" produces better work than "make a new one."

Living context is the maintenance habit. When the tool gets something wrong twice, the fix belongs in the brief file, not in your patience. A repeated correction means the brief is missing a line; add the line, and every future session starts past the mistake. The brief from day one should not be the brief you are still running a month later, yet keeping it current is the most skipped task in AI-assisted work and among the highest-leverage.

The rule behind every layer is the same: the model can only act on what is in front of it.

Read alongsideWorking with AI Where AI fits

Tool selection

The Fundamentals

Pick the AI interface by how closely you need to watch the work, because there is no single best tool, only a right one for the step you are on. Much of the friction you will feel comes from using the wrong interface for the task.

Chat (claude.ai, ChatGPT, Gemini) for the thinking before code. Use it to shape requirements, compare approaches, draft specs, and make decisions before anything touches a repository. The conversation itself is the deliverable, and you are reading every word of it. Once the task needs to read or change real files, move on.

Editor-integrated AI (Cursor, Copilot) for in-place edits while you read. Use it to complete a function, rename something cleanly, or tighten the file you already have open. Every suggestion lands under your cursor, so oversight is continuous by default. It is the wrong tool once a change spans many files.

A terminal agent (Claude Code, Codex) for real multi-file work. Hand it tasks where the tool plans the change, edits across the project, and runs the result, while you review at checkpoints instead of keystrokes. Keep a CLAUDE.md (Claude Code) or AGENTS.md (the same convention for Codex and others) at the project root so every session starts with your conventions already loaded.

Background or cowork modes for long tasks you check in on. Use them for broad refactors, batch fixes, and other work you would willingly leave alone for twenty minutes at a stretch. Because you watch least here, reserve these modes for tasks with success criteria you can verify when you come back.

Choose by oversight, not by ranking. The question is how much you need to watch the work, not which tool is best, since the models behind these interfaces are largely the same. Too autonomous and you lose sight of what is changing; too hands-on and you give up the speed you came for.

Switch deliberately, one tool per task. Changing interfaces because the work changed form is good practice, while drifting between them mid-task and leaving context behind is how sessions go sideways. When a session starts to feel like a fight, ask whether you are in the wrong interface before you blame the model.

Read alongsideWorking with AI Languages & frameworks

The iteration loop

The Fundamentals

AI work is iterative: the first output is a draft, and the quality of what ships depends on how you run the loop from there.

Specific feedback beats vague feedback. "Make it better" gives the model nothing to act on. Useful feedback names what is wrong, what right looks like, and one constraint:

What is wrong: "the table re-sorts on every refresh."
What right looks like: "the sort order holds until the user changes it."
One constraint: "without touching the export logic."

A concrete before and after beats adjectives like "cleaner" or "more professional," and the same logic covers new work: "like this one, but different in these ways" beats "make a new one," because an example plus specified differences carries more information than a description from scratch.

Verify before you trust. Checking output before accepting it is the difference between shipping with these tools and shipping bugs with them. Run the result and watch it do what was claimed. Read the diff and confirm the change touched nothing you did not expect. Check anything the output claims exists (a function, a library, a configuration option) against the real docs or source. And read the actual output, not the summary of the output.

Know your three exits when stuck. When the loop stalls, pick an exit deliberately instead of repeating the same correction in different words:

Push back with a sharper constraint. Choose this when you can name exactly what is off. Restate the target and add the constraint the last attempt violated.
Reset the thread with better context. Choose this when the session keeps circling an early wrong assumption. Start fresh, and move whatever was missing into your context brief (CLAUDE.md for Claude Code, AGENTS.md for Codex and other tools) so the new session starts past the confusion.
Do the two-minute piece yourself. Choose this when one small stubborn edit is blocking otherwise good work. Make the change by hand, then hand the task back and let the tool continue.

Each pass through the loop should add information; a turn that adds none is your signal to take an exit.

Read alongsideWorking with AI

The kill rule

The FundamentalsThe PracticeThe Frontier

If you have corrected the same thing several times in different words and the output is still wrong, the next attempt will not fix it; the thread is dead, and the right move is to end it. Correction loops do not converge, because every failed attempt stays in the context and keeps pulling new output toward the same wrong approach.

Recognize a dead thread.

The tool keeps fixing the same thing in different but equally wrong ways.
Each pass repairs one problem and introduces another.
You are restating the same correction in new words.
You are irritated, which is usually the signal you notice first.

Kill it cleanly.

Stop before you send another correction. Nothing in the current session improves by continuing it.
Name the constraint that kept being missed. Write it as a plain sentence ("never modify the schema", "match the existing error format"). If it is a standing rule, add it to CLAUDE.md or AGENTS.md so every future session starts past the mistake.
Open a clean session with that constraint up front. State it before the task, and attach an example of correct output if you have one.

Recovery moves when a clean restart is not enough.

Isolate the failing piece. Pull it out of the larger task and test it on its own; a small reproduction tells you whether the problem is the piece itself or everything around it.
Ask the tool to state its assumptions. Have it list what it is treating as true about the code, the data, and the goal, then correct the wrong entry directly instead of correcting the output again.
Fix the foundation, not the decoration. If an early decision in the work is wrong, change that decision; patches layered over a wrong base are what produced the loop you just escaped.

The kill rule is cost control, not failure. Every round in a dead thread spends time, tokens, and attention, and ending it early is the cheapest available outcome. If clean restarts keep failing too, the task may be faster to do by hand, which is also a legitimate exit.

Read alongsideWorking with AI Guardrails Keep a human in charge Beyond one agent

Verifying the work

Catching what is wrong before you build on it or ship it.

The verify-first rule

The FundamentalsThe PracticeThe Frontier

Make the tool prove the foundation works before features go on top: if the core method, library, or integration does not run in your environment, everything built on it is waste.

The failure this prevents. AI tools default to building. Asked for a feature, Claude Code, Codex, or a chat session produces one without checking whether the chosen environment can do the thing the whole build depends on. It picks an approach, builds the interface, adds features, and only then do you discover that the iframe cannot call your API, the local file is blocked by CORS, or the function times out before the core call completes. Everything stacked on that constraint is wasted work, often hours of it.

The move is the smallest runnable proof first. Add one extra prompt at the start of any new build: "Before you add any features, prove the foundation works. Build and run the smallest test that shows the core thing works here. If it fails, we change approach before adding anything else." Adapt it to the situation:

A proposed new environment (an artifact, a notebook, a serverless function): "What is the one call this whole thing depends on? Test that first."
An unfamiliar library: "Show me a five-line standalone test that the core function works in our environment."
An unproven third-party integration: "Make one real call and show me the response. Then we build."

You are violating the rule when:

Features are being added and the core has never run in front of you.
Every error is answered with more code instead of a rethink of the approach.
Half an hour has passed without a single runnable result.
Each fix introduces the next bug, the classic sign of a failing foundation.

The recovery. Stop the session, name the broken constraint ("this environment cannot reach the API"), test it in isolation, and change the foundation if the test fails. Only then resume building features. The time already spent is sunk; the next feature built on a base that cannot hold it is the waste you can still avoid.

Verify-first checks the approach before you build; checking each output after it lands is its own discipline, covered in Verification practices.

Read alongsideWorking with AI Review the build Guardrails Help users catch errors Fleet verification

Verification practices

The FundamentalsThe Frontier

A change counts as verified when you have watched it work, not when the tool reports it finished; the habits below make that standard automatic.

Run the thing. Whatever was built, execute it: click the button, hit the endpoint, render the page. "It should work" is a prediction, and the standard is an observation, the feature doing its job in front of you.

Read the diff before accepting. Every coding tool shows the change file by file before you apply it, and that view is the full statement of what happens to your project if you say yes. A recap of the change compresses and sometimes omits; the diff does not. The reading moves (filename first, minus is what was, plus is what is now, file count checked against the size of the task) are taught in Review what the AI built.

Never ship code you have not read. If you could not explain to a colleague what a change does, it is not ready to deploy, however clean the demo looked. Match the depth of the read to the stakes: a wording fix earns a skim, a billing change earns every line.

Add the cheap checks. Each costs a minute or two and covers ground the habits above can miss:

Click the actual flow once, like a user. Open the product rather than the code: sign up, run the feature, read what a stranger would read. Running the code proves it executes; walking the flow proves it works where users meet it, and it is the fastest way to catch a feature wired to sample data.
Ask the tool what it is least sure about. Before you accept, ask which part of the change is most likely to be wrong or incomplete, and read that part first. The answer points your attention at the riskiest file instead of spreading it evenly across the diff.

Read alongsideWorking with AI Review the build Monitoring Fleet verification

Failure modes catalog

The FundamentalsThe PracticeThe Frontier

Bad AI output falls into recognizable patterns, and each pattern has a tell you can spot during review and a counter that puts the work back on track.

Hallucination. The output references things that do not exist: imports that resolve to nothing, function calls with plausible signatures that fail at runtime, configuration options that appear in no documentation. It is most common with newer libraries, where training data is thin. Check imports and API calls against the real docs before you build on them, and bring those docs into context so the next pass works from the actual interface.

Slop. The output hedges every claim, qualifies every option, and takes no position; if you deleted the whole answer, your decision would not change. Demand a position and a reason: "recommend one option and explain why, and I will push back if it is wrong."

Mock implementations. The work runs in a demo and fails on real input. The tells are TODO comments, sample data, hardcoded responses, and answers that never change no matter what you feed in. Ask what is real: have the tool separate working implementation from placeholder, then request the real version or supply the missing piece the placeholder was standing in for.

Over-engineering. The output contains abstractions the task never asked for: interface layers with a single implementation, configuration for cases that do not exist, a class hierarchy where a function would do. Ask for the boring version, and make that preference standing by adding it to CLAUDE.md or AGENTS.md: simple, direct code by default, abstraction only at the second use site.

Scope creep. The diff touches more files than the task warranted; you asked for a bug fix and received a refactor of everything nearby. Ask why each file changed, keep the changes with good answers, revert the rest, and scope the next prompt to the files that matter.

Confidently wrong. The output is fluent, specific, and false, and it is the expensive pattern because polish carries no information about accuracy. Only verification catches it, never tone: run the code, test the claim, check the reference. Every other pattern in this catalog shows up in the text if you read for it; this one is the reason reading alone is never enough.

Read alongsideWorking with AI Review the build Guardrails Monitoring Help users catch errors How fleets break

Evals, the rules

The FundamentalsThe Practice

Write down what good output means before you change anything that shifts behavior, then hold every change to that bar. The rules below cover the bar, the cases, the graders, and the gate.

The bar is a product decision. Define it from the job the feature does, name the load-bearing dimension of quality (the one that decides whether an output is usable), and write it as what an output must never do, what it must always do, and the target you are raising. If the product owner cannot state this, no grader can test it.

Cases sample reality. Draw them from real usage first and include adversarial inputs always, because users supply both. Keep a golden set that stays fixed so scores stay comparable, alongside a living set that takes in new cases as usage shifts. Every case carries its expectation: the input plus what a correct output looks like, never the input alone.

Deterministic graders come first. Use exact matches, schema checks, and code that runs wherever the dimension allows. A model judge is for judgment dimensions only, and only when anchored to a written rubric, run with candidate order swapped, and spot-audited against human reads. Humans calibrate the stack, and when trained raters disagree on a case, the rubric is broken, not the raters.

Every behavior-shifting change runs the set before merge. Prompt edits, model swaps, retrieval and tool changes count. Yesterday's bar is the floor: a change that violates a must-never or drops the golden-set score does not ship. Model upgrades are behavior changes too, so re-run the set on each one, including the ones the provider pushes unannounced.

Every production failure becomes a case. Within the week it appears, the failing input joins the living set with its expectation written down. A set with no route in from production measures last quarter's product.

Protect the metric from becoming the target. Hold out cases the team never tunes against, rotate cases so the set does not go stale, pair every score with a counter-metric that catches the obvious shortcut (brevity with completeness, helpfulness with accuracy), and keep a scheduled human read of raw outputs, because dashboards can hold steady while the product gets worse.

If you cannot say what good means, you cannot ship changes safely.

Read alongsideThe demo isn't the product Define what good means Build your test cases Grade the outputs Block bad changes Signals after you ship When metrics get gamed Build your eval Monitoring

Operating in production

Keeping the bill, the risk, and the scope under control.

Cost awareness

The FundamentalsThe Frontier

Every model call spends real money, so the rule is to put limits in place before the feature meets real users, because cost failures report themselves as an invoice weeks after the mistake that caused them.

The patterns that bite. Nearly every surprise bill traces back to one of these:

No per-user cap on anything that calls a model. One heavy user, or one bot that finds the endpoint, runs the meter alone, and your average-cost math says nothing about your worst user.
A background loop with no exit condition. A job that retries, reprocesses, or polls until something external stops it, when nothing external is watching overnight or over a weekend.
Long context on every call when a summary would do. You pay for every token you send, every time, and resending a full history or document on each request multiplies cost without improving the answer.
Premium models on trivial tasks. Classification, extraction, and short rewrites rarely need the flagship model your hardest feature does.

The fixes. Apply them in this order:

Per-user caps from day one. Requests per day, tokens per day, or dollars per day, enforced in code. Retrofitting a cap after launch means writing it during the incident it would have prevented.
A billing alert plus a hard cap. The alert tells you spending crossed a line; the cap stops the spending whether or not you are awake. Set the alert at a number that would annoy you and the cap at a number that would hurt. Every major provider's console offers both.
The smallest model that passes your quality bar. Run the task on a cheaper tier, compare the output against the flagship, and pay the premium only where the difference shows up in the result.
Watch the cost dashboard in week one, not month three. The first days of real traffic are when usage looks nothing like your testing, and the watching habit is taught in Monitoring, how you know it broke (and what it costs).

None of this takes long to set up, and every item on the list is cheaper than the failure it prevents.

Read alongsideHosting Monitoring Guardrails What a fleet costs

When not to use AI

The FundamentalsThe Frontier

Reach for AI when the task needs judgment over ambiguous input, and reach for deterministic software the moment it does not. A model produces a plausible answer on every run, not the same correct answer every time, and that is the wrong property for the situations below.

The answer must be exact every time. Math and currency arithmetic, ID lookups, exact-match search, and anything a database query already answers belong in regular code. A model approximates these and fails in ways that look right, so use the actual tool rather than a model imitation of it. Where AI helps at all, have it write the query or the formula instead of producing the number.
A confident wrong answer is expensive and nobody checks it. In legal positions, medical guidance, and anything that moves money, AI can serve as a drafting aid for a qualified reviewer. If no human review stands between the output and the consequence, do not put a model there.
The task is smaller than the overhead. For a typo, one renamed variable, or a single find and replace, writing the prompt and checking the output costs more than the work itself, so do the work directly.
Users expect the system to be authoritative. People treat output as correct because your product produced it, and you own every error that ships under that assumption. If a feature's job is to be the answer (a balance, a deadline, a dosage), back it with deterministic logic, and reserve AI for the parts of the product that are allowed to be a helpful draft.

The meta-skill is reaching for the boring solution without apology. Knowing when AI is the wrong tool separates builders who use it well from builders who use it everywhere. A validation rule, a database query, a dropdown, or a scheduled job is cheaper, faster, and testable than a model doing the same work, and choosing it is good engineering rather than a lack of ambition. The same discipline applies at your own desk: have Claude Code write a script that computes the number instead of asking a chat session for the number. The plain rule is that if a spreadsheet formula solves it, the spreadsheet wins.

Read alongsideWhere AI fits Guardrails Beyond one agent

Shipping agents, the rules

The Practice

When a product acts on a user's behalf, the design work moves from what it says to what it does. The rules below cover actions, tools, the autonomy ladder, blast radius, injection, receipts, and evals.

The unit of design is the action, not the prompt. Inventory every action the agent can take (every send, write, purchase, and delete) and mark each one reversible or irreversible. That inventory is the action surface these rules operate on.

Every tool is delegated authority. Flag each tool read, write, or irreversible before the agent gets it. A tool description is not documentation, it is a behavioral spec the model acts on, so write it with the care you give a prompt. Default to least capability: the narrowest toolset the job requires.

Place every action on the autonomy ladder. The rungs run from suggest to draft to act with approval to act with undo to act silently. Placement is per action, not per agent, and it is set by reversibility times blast radius. Evidence promotes an action a rung; incidents demote it. Start each action one rung lower than you believe it has earned.

Bound the blast radius. Give the agent scoped tokens instead of your credentials, allowlists for recipients and destinations, caps on spend and volume, sandboxes for risky work, and a kill switch that stops an action already in flight, not just the next one.

Assume injection succeeds. Treat everything the agent reads as data, never as instructions. Never let private data, untrusted content, and an outbound channel coincide without a gate a human controls.

Keep receipts and plan recovery. Maintain a ledger of actions that a user can read. Make retries idempotent so a failure never doubles a payment or an email. Build undo as compensating actions (the refund, the reversal), and label the actions that have none as irreversible before they run.

Eval the trajectory, not just the outcome. An agent can reach a correct end state through actions you would never approve, so grade the tool calls, their arguments, and their order against sandboxed tools rather than live ones, and let pass rates move actions up and down the ladder.

Ship at the rungs you chose, not the rungs the demo made you feel.

Read alongsideWhen AI takes action What your agent can do The autonomy ladder Limit the damage When inputs attack Recover from failures Judge the whole run Write the Agent Charter Keep a human in charge

Securing AI products, the rules

The FundamentalsThe Practice

An AI feature adds doors to your product, and attackers try doors before users do. The rules below cover the threats, the keys, the data, the chain, the layers, the red team, and the page you sign.

Name the threats before you defend anything. A threat is an asset, a door, and a path between them, written in one plain sentence. Five lines beat a workshop. You cannot defend doors you have not listed.

Hand out the smallest keys you can. The product acts with whatever identity you give it, so the keys, not the model, set the blast radius. Scope to the one inbox rather than all of them, separate read from write, and rehearse revocation, because a key you cannot revoke in minutes is a breach waiting for its date.

Treat every copy of data as a leak path. Typed, stored, embedded, logged: each stop is a copy with its own readers and its own retention number. Filter retrieval by what this user may see, scrub secrets from the logs, and keep the shortest defensible retention at every stop.

Verify the chain you import. A model file is a program, and loading it is running it. Take weights from named publishers in safe formats, pin every dependency, and read what a tool or MCP server requests before you adopt it.

Defend in layers, not in wording. The prompt is a request; the deterministic floor is a fact. Allowlists, scopes, caps, and egress rules stop what the prompt cannot, classifiers catch the middle, and a human gates the irreversible.

Cap what it can spend. Per-user and per-day limits with an alert and a hard stop turn a catastrophic abuse into a bounded one. The bill is part of the attack surface.

Attack your own product on a schedule. A red-team finding is finished only when it becomes an eval case in the regression gate. If your red team never wins, fix the red team, not the scoreboard.

Write the posture down. Fill the Security Posture from work that exists and runs, pull the kill switch once to prove it works, and decide now what re-opens the page.

Ship the product you already tried to break.

Read alongsideWhy AI gets attacked Map your threats Whose keys it holds Where data leaks Parts you didn't build Defense in layers Attack it yourself The Security Posture Guardrails

High-stakes industries, the rules

The FundamentalsThe Frontier

When the product ships into a regulated industry, the craft does not change; what changes is that every practice becomes mandatory and every claim needs evidence. The rules below cover the envelope, the model, the context, the gates, fairness, and the record.

Draw the envelope before the first prompt. Decide what the product may do freely, what it must refuse, and what it hands to a licensed human. Ambiguity routes to people. If an output would bind the firm (a price, a promise, a coverage decision), it stays out of the may-do zone until a control exists.

Educate, never advise. Advice carries duties that attach to licensed processes, and the model cannot hold a license. The compliant forms are education, options, and a clean handoff, designed as product moments rather than legal apologies.

Choose a model you can defend. Answer the custody questions before the capability questions: data use, residency, retention, attestations, indemnification, exit. Keep an inventory. A fine-tuned model is a new model: it re-enters validation, and the refusal suite runs again.

Ground every answer in approved truth. Answers come from the corpus the firm stands behind, with citations, current versions, and the user's permissions traveling with the question into the index. A stale or leaked source is an incident, not a relevance bug.

Gate both directions. Screen inputs for personal data and injection; check outputs for advice language and groundedness; insert mandated disclosures deterministically. The gate that must never fail should not be probabilistic, and every gate decision gets logged.

Prove fairness and explain every no. Slice your evals by the groups your regulator cares about, watch for proxy features, and design specific adverse-action reasons up front. If the model cannot support real reasons, it is not ready for the decision.

Keep records an auditor can replay. Log the model version, prompt version, retrieved sources, gate decisions, and human review for every interaction, under the same access controls as the product itself. Run the prohibited-behaviors suite in the regression gate on every change.

Make governance the permission. A prompt change is a model change: define what re-triggers review before the first urgent fix, not during it. Keep the kill switch tested, and fill the High-Stakes Clearance from artifacts that exist and run, never from intentions.

Ship the product that can answer for itself.

Read alongsideWhy it's different What AI may never do Choosing the model The right context Security Fairness Testing & evidence The Clearance Legal basics