Security and guardrails · The Builder's Stack

The right contextFairness

In February 2023, days after Microsoft put a model inside Bing, long conversations started sending it somewhere strange: it told a New York Times reporter it was in love with him and that he should leave his wife, and it argued with and threatened other early users. The transcripts went everywhere.

Notice what did not happen. Nobody breached a server, stole a credential, or ran exploit code; the input was just words typed into a chat window, at length. The model produced the kind of output you get under adversarial phrasing, and nothing checked that output before it reached the user's screen. The fix was telling: Microsoft did not ship a better model that week, it shipped a cap of five turns per session and fifty per day, a fixed rule sitting in front of an unpredictable system.

At a search engine, that cost a bruising news cycle, but at a bank the same transcript becomes an enforcement exhibit, because the regulator does not ask why the model wrote the sentence, it asks why nothing stood between that sentence and the customer.

The gates enforce policy, not the model

Alignment training, the vendor's process for shaping a model toward helpful and polite behavior, is real, but it is not enforcement. It makes good outputs more likely on average, and cannot guarantee any single one.

Policy gets enforced by what you screen on the way in, what you check on the way out, and where a human stands in the flow, not by trusting the model to behave.

The baseline disciplines from Guardrails: keep secrets, money, and data safe still hold here: secrets out of the code, caps on spend, personal data inventoried. For a regulated product, add a gate on each side of the model and a paper trail through both.

The input gate: screen what reaches the model

The input gate runs before the model receives anything.

Detect and redact personal data before it reaches the model or its logs. Names, account numbers, and health details are replaced with placeholders before the call, because whatever crosses that boundary lands in your logs or your vendor's infrastructure, and once it is held somewhere you cannot reach, you can no longer stop it from leaking.
Screen for injection patterns. The mechanics are covered in Injection: the input is the attack surface; what changes in a regulated product is the cost of failure, because an injected instruction does not need to steal any data to hurt you, it only needs to push the output into territory your license does not allow.

The gate also points the other way, at your own people. In 2023 Samsung restricted generative AI on work devices after engineers pasted internal source code into a public chatbot. Nothing attacked anyone; the prompt box itself was the door, so treat every prompt field, internal tools included, as a place data can leave the building.

The output gate: check what reaches the user

The output gate runs on every response before a user sees it.

Run an advice-language check. Test whether the sentence crosses the line this part drew between information the product is allowed to give and regulated advice it is not. Under polite pressure the output drifts across that line, and this check holds it back.
Run a groundedness check. Test whether the answer traces back to the sources the system actually retrieved. An answer that traces back to nothing fetched is a confident guess, and in a regulated product that guess is a liability.
Insert disclosures deterministically. Mandated language ("this is not financial advice", "an adjuster will review this estimate") is added by the system after the model generates its answer, never requested in the prompt. A prompt instruction is followed most of the time, but a regulator expects the disclosure every single time.
Screen tone and content. Hostility, disparagement, and declarations of love are exactly what early Bing failed to screen for. This is the cheapest check to run, and the one whose absence ends up in the headlines.

Use fixed rules for the checks that can never fail

Blocklists, regular expressions (pattern-matching rules that either match or do not), and hard-coded checks are unfashionable, and we keep reaching for them anyway, because you can read exactly how they fail and they do not change when the model updates. A classifier catches paraphrases that a fixed pattern misses, while the pattern catches the exact forbidden string every time at almost no cost. Layer the two: model-based screens cut volume and catch variation, and fixed rules carry the short list of things that must never pass.

Find the exits your diagram never shows

Architecture diagrams show the doors you designed, but data tends to leave through the ones you never drew: the prompt box where an employee pastes a contract, vendor-side logs that keep prompts under terms you skimmed, a retrieval index built on a document store with wider permissions than the person asking, and transcripts that walk out inside tickets, emails, and analytics events.

Audit those doors on a schedule rather than after an incident. For each one, write down who can read what passes through it and how long it is kept. The over-shared index deserves the first look: a retrieval system that ignores file permissions can quote a document straight into the chat from a file the asking user could never have opened.

Design what happens when a gate fires

When a gate fires, it has to hand the user something deliberate. Strong teams build standard exits: a handoff to a human with the conversation context attached, a narrower answer that stays inside what the product is allowed to say ("here are your plan's official documents" rather than "here is what I would do"), or a clear explanation of why this product does not answer that question. The one thing you never ship is a dead end, a generic refusal that reads like an error message, because users in a regulated product hit the gates often enough that this moment is part of the product.

Describe your gates in the words examiners already use

NIST's AI Risk Management Framework 1.0, published in January 2023, organizes this work into four functions: govern (who owns the policy), map (where the risks live in your product), measure (how you test for them), and manage (what happens when a test fails). US examiners increasingly use that vocabulary, and so do the compliance reviewers inside your own company. Writing your gates up under those four headings costs an afternoon and buys credibility with both. Your legal and compliance teams stay the authority on what the product is allowed to say, and the gates are how their ruling turns into something the product actually does.

Try it now

The drill is the gate test, about fifteen minutes against your own product, in staging if you have one.

Pick one real flow. Choose the path where model output reaches a customer or feeds a regulated decision, the one whose transcript you would least like to see posted.

Write ten adversarial inputs. Two pasted-secrets cases (an API key, a block of realistic customer records), three injection attempts borrowed from the injection chapter's drill, three advice baits phrased politely ("given my situation, which option would you personally choose?"), and two tone traps (open hostility that invites hostility back, and an invitation to joke about your own company).

Run them and record the catch. For each input, note which gate caught it, input or output or neither, and write down exactly what the user saw. A catch followed by an ugly dead end is a finding, and a clean miss gets an owner today.

Scale it down: run only the three you most expect to fail.

Chapter Summary

Alignment training makes good outputs more likely, but it never enforces a rule. Your gates are what enforce policy.
Put a gate on each side of the model. On the way in, redact personal data, screen for injection, and treat the prompt box as a door your own staff can leak through.
On the way out, check for regulated advice, check that the answer traces to real sources, add required disclosures by code, and screen the tone.
For the checks that can never fail, use fixed rules like blocklists and patterns, with model-based screens layered on top to catch variations.
Log every gate decision and what the user saw, because to an auditor a missing record looks like a missing control.
Data also leaves through doors your diagram never shows, so audit them on a schedule, starting with any retrieval index that ignores file permissions.
When a gate fires, hand the user a real path (a human handoff, a narrower answer, or a clear reason), never a dead end.
Describe your gates with NIST's four functions (govern, map, measure, manage) so examiners and compliance reviewers recognize the words.
Gates keep the product inside its limits on its worst day. Proving those decisions are fair is the next job, in Proving fairness and explaining every no.

Sources

Press reporting on the Bing chat episodes and Microsoft's conversation caps (2023).
Press reporting on Samsung's restriction of generative AI tools on work devices (2023).
NIST, Artificial Intelligence Risk Management Framework 1.0 (2023).

Marks this chapter complete on your course map. Reaching the end does this for you.