The Operating Manual · Interactive

The Console

The cycle, broken all the way down. Pick a move, open any activity to see the artifacts it produces, and take the whole thing with you as a checklist you can run.

ShapeShipTrackContinuous Operations

Where to start

You do not read this end to end. Start where the work is hardest right now.

Copy or download the current view as a checklist you can fork into your own docs.

Shape

Decide how the system behaves, and make it real.

What it produces

Discovery guide: the questions you ask real users to tell whether AI is the right answer, not just a possible one.
Problem frame: who the user is and what failure looks like for them.
Definition of good: the plain bar for behavior you will hold the product to.

What it produces

Behavior contract: what the model must do, must never do, and one example of each.
Starter system prompt: its role, boundaries, tone, and output format, versioned like code.
Escalation rules: the triggers that hand a conversation to a human.

What it produces

Model rationale: the model you chose and why, weighed on capability, cost, latency, scale, data sensitivity, and modalities. Most products route: a frontier model for the hard path, a smaller one for simple volume.
Context architecture: what gets retrieved and what stays in the prompt.
Grounding decision: prompt context, retrieval, fine-tuning, or a mix. They are layers, not alternatives.

What it produces

Working prototype: a real version you built yourself, run against messy real inputs.
Prototyping checklist: the steps you reuse on the next idea.
Edge-case list: what it got wrong (the blurry plate, the three foods, chicken or turkey), fed back into the contract.

Ship

Put it in front of a human, behind guardrails, at a workable cost and speed.

What it produces

Input and output guardrails: filter inputs for private data and prompt injection, and validate outputs against the schema and content rules.
Escalation path: anything irreversible or high-stakes goes to a human first.
Rollout checklist: the tripwires that halt a release.

What it produces

Regression evals: a small set of known-good cases you run on every change, plus automated faithfulness and safety checks.
Pass and fail thresholds: the bar a change must clear, on a rubric you own.
Release gate: nothing ships until it clears, backed by a weekly human sample that catches new failure modes.

What it produces

Preview and undo design: preview what an action will do, confirm the irreversible ones, and keep undo cheap.
Source and confidence display: show where an answer came from and how sure the model really is, not one flat tone for everything.

What it produces

Latency and cost budget: the speed the experience needs and the cost a request may run.
Routing and caching plan: cap output length, cache repeats, route easy work to cheaper models, and stream for perceived speed.

Track

Find out whether it is behaving, and catch what users never report.

What it produces

Quality dashboard: accuracy, faithfulness, latency, cost, and safety, live.
Metrics taxonomy: product signal kept separate from model behavior.
Session-review habit: read a sample of real sessions. The numbers say something is wrong, the sessions say what.

What it produces

Drift alerts: a warning when a metric slides while the product still looks fine from the outside.
Regression gate on model changes: re-run the eval suite on every model update before it reaches users.

What it produces

New eval cases: each real failure turned into a test you keep.
Ranked list of contract fixes: what Shape changes on the next turn of the cycle.

Continuous Operations

The umbrella across every turn of the cycle.

What it produces

Curation policy: what is eligible to be indexed.
Refresh cadence: how often sources update, and how stale content gets flagged.
Conflict rules: how a new source wins over an old one it contradicts.

What it produces

Access-as-behavior rules: what each person may see, treated as a product decision, not just infrastructure.
Safe-refusal patterns: what the system says when asked for what someone cannot have, without leaking that it exists.

What it produces

Supervision design: a human or an explicit check over anything that acts on its own.
Iteration caps: limits, loop detection, and give-up criteria so an agent cannot run forever.
Reliability budget: a per-step error bar, because ten steps at 95% each finish near 60%.

What it produces

Hiring rubric: how you hire for the new craft.
Org change plan: how you bring the wider organization to the new bar.