In "Shape · Ship · Track" we said: You shape how a model behaves, you ship it to a human, and you track whether it holds, then you do it again. This manual tells you how to run that cycle. It does not tell you whether you are pointed at the right problem.
Quick reference
| Move | Activity | What it produces |
|---|---|---|
| Shape | Frame the problem | Discovery guide, Problem frame, Definition of good |
| Write the behavior, then the prompt | Behavior contract, Starter system prompt, Escalation rules | |
| Choose the model and grounding | Model rationale, Context architecture, Grounding decision | |
| Prototype it | Working prototype, Prototyping checklist, Edge-case list | |
| Ship | Build the guardrails | Input and output guardrails, Escalation path, Rollout checklist |
| Build the eval suite | Regression evals, Pass and fail thresholds, Release gate | |
| Earn trust when unsure | Preview and undo design, Source and confidence display | |
| Set the cost and speed budget | Latency and cost budget, Routing and caching plan | |
| Track | Watch behavior in production | Quality dashboard, Metrics taxonomy, Session-review habit |
| Catch the drift | Drift alerts, Regression gate on model changes | |
| Feed it back into Shape | New eval cases, Ranked list of contract fixes | |
| Continuous Operations | Govern the knowledge | Curation policy, Refresh cadence, Conflict rules |
| Govern access and safety | Access-as-behavior rules, Safe-refusal patterns | |
| Supervise the agents | Supervision design, Iteration caps, Reliability budget | |
| Build the team | Hiring rubric, Org change plan |
Each activity below expands its row. The bold term in every bullet is the output named in the table.
Shape: decide how the system behaves, and make it real
Shape is the move that changed the job. You are no longer specifying features. You are specifying behavior, and a probabilistic model behaves differently every time unless you give it a reason not to.
Frame the problem. Decide what the model should do, and what good looks like, before you build.
- Discovery guide: the questions you ask real users to tell whether AI is the right answer, not just a possible one.
- Problem frame: who the user is and what failure looks like for them.
- Definition of good: the plain bar for behavior you will hold the product to.
Write the behavior, then the prompt. Put the behavior in a contract, then build the system prompt that enforces it.
- Behavior contract: what the model must do, must never do, and one example of each.
- Starter system prompt: its role, boundaries, tone, and output format, versioned like code.
- Escalation rules: the triggers that hand a conversation to a human.
Choose the model and grounding. Pick the model as a product decision, then decide how it gets facts it was not trained on.
- Model rationale: the model you chose and why, weighed on capability, cost, latency, scale, data sensitivity, and modalities. Most products route: a frontier model for the hard path, a smaller one for simple volume.
- Context architecture: what gets retrieved and what stays in the prompt.
- Grounding decision: prompt context, retrieval, fine-tuning, or a mix. They are layers, not alternatives.
Prototype it. Build a real version yourself in a day, and let it find the cases your spec missed.
- Working prototype: a real version you built yourself, run against messy real inputs.
- Prototyping checklist: the steps you reuse on the next idea.
- Edge-case list: what it got wrong (the blurry plate, the three foods, chicken or turkey), fed back into the contract.
Ship: put it in front of a human, behind guardrails, at a workable cost and speed
AI failures are quiet. A confidently wrong answer looks exactly like a right one, so a bad nutrition number just tells a parent their underfed kid is fine. Ship is where you defend against the failures users will never report.
Build the guardrails. Decide what the model may never do on its own, and enforce it.
- Input and output guardrails: filter inputs for private data and prompt injection, and validate outputs against the schema and content rules.
- Escalation path: anything irreversible or high-stakes goes to a human first.
- Rollout checklist: the tripwires that halt a release.
Build the eval suite. Measure that it works, because you cannot catch quiet failures by eye.
- Regression evals: a small set of known-good cases you run on every change, plus automated faithfulness and safety checks.
- Pass and fail thresholds: the bar a change must clear, on a rubric you own.
- Release gate: nothing ships until it clears, backed by a weekly human sample that catches new failure modes.
Earn trust in the uncertain moments. Design those moments instead of hiding them.
- Preview and undo design: preview what an action will do, confirm the irreversible ones, and keep undo cheap.
- Source and confidence display: show where an answer came from and how reliable it is, not one flat tone for everything.
Set the cost and speed budget. A good response is relevant, consistent, appropriate, affordable, and fast. You cannot maximize all five.
- Latency and cost budget: the speed the experience needs and the cost a request may run.
- Routing and caching plan: cap output length, cache repeats, route easy work to cheaper models, and stream for perceived speed.
Track: find out whether it is behaving, and catch what users never report
People do not file bugs against a quietly wrong AI. They lose trust and leave. Track is where you go looking for the failures they never send you.
Watch behavior in production. Measure what matters once real people are using it, and read real sessions.
- Quality dashboard: accuracy, faithfulness, latency, cost, and safety, live.
- Metrics taxonomy: product signal kept separate from model behavior.
- Session-review habit: read a sample of real sessions. The numbers say something is wrong, the sessions say what.
Catch the drift. The model moves under you, so watch for slow decay.
- Drift alerts: a warning when a metric slides while the product still looks fine from the outside.
- Regression gate on model changes: re-run the eval suite on every model update before it reaches users.
Feed it back into Shape. Turn every failure you catch into the next turn's work.
- New eval cases: each real failure turned into a test you keep.
- Ranked list of contract fixes: what Shape changes on the next turn of the cycle.
Continuous Operations: the umbrella across every turn of the cycle
This is not a fourth step in the sequence. It is what keeps the cycle running once you have a team and a portfolio, not one person and one product. It follows from the fact that makes the cycle a cycle: a probabilistic system is never finished, so operating it is never finished either.
Govern the knowledge. A retrieval system is only as good as what it is allowed to read.
- Curation policy: what is eligible to be indexed.
- Refresh cadence: how often sources update, and how stale content gets flagged.
- Conflict rules: how a new source wins over an old one it contradicts.
Govern access and safety. Tell the system what each person is allowed to see.
- Access-as-behavior rules: what each person may see, treated as a product decision, not just infrastructure.
- Safe-refusal patterns: what the system says when asked for what someone cannot have, without leaking that it exists.
Supervise the agents. Once a product acts on its own, every weakness compounds.
- Supervision design: a human or an explicit check over anything that acts on its own.
- Iteration caps: limits, loop detection, and give-up criteria so an agent cannot run forever.
- Reliability budget: a per-step error bar, because ten steps at 95% each finish near 60%.
Build the team. A cycle only you can run is not yet a practice.
- Hiring rubric: how you hire for the new craft.
- Org change plan: how you bring the wider organization to the new bar.
Where to start
You do not read this end to end. Start where the work is hardest right now.
- You cannot say what good behavior is? Start in Shape.
- It seems to work, but you cannot prove it? Start in Ship.
- You suspect it is sliding but cannot see how? Start in Track.
- It works for you and breaks for everyone else? Start in Continuous Operations.
Sources and further reading
This sits alongside the Founding Essays. The discipline and the fundamentals are in "AI-Native Product Management Is a Discipline." The cognitive science behind Ship and Track is in "The Mind on the Other Side of the Model." The three-move practice is in "Shape · Ship · Track."
- Retrieval-augmented generation: Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Facebook AI Research, 2020). The naive-to-advanced-to-modular progression: Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey" (2023).
- The agent that senses, plans, and acts: from robotics and autonomous-systems research, applied here to language-model agents. OpenAI's five levels of progress toward AGI: reported by Bloomberg (2024).
- Deeper pieces on system prompts, model selection, evals, and agent reliability follow this one, each with its own citations.
FuelTheFam is real and live at fuelthefam.com.