Testing, evidence, and audit trails · The Builder's Stack

FairnessThe Clearance

Hundreds of US hospitals ran the same vendor's sepsis prediction model, software that watched vitals and labs and scored each patient's risk of sepsis, the runaway infection response that turns fatal when treatment is late.

In June 2021, an external validation at Michigan Medicine, a test run by the hospital on its own patients, reported that the model missed about two thirds of sepsis cases at the deployed threshold while alerting on nearly one in five hospitalized patients.

The model had been validated, by the vendor, on the vendor's data, and nobody had proven it in the population it actually watched. That gap is what this chapter is about.

In a regulated product, you have to prove the system works here, on your own cases, with records an outsider can check.

Vendor validation is not your validation

You run this on the eval stack from The Practice, and the rule from The regression gate: no change ships blind still holds: every change that shifts behavior reruns the set, and yesterday's score is the floor you cannot drop below. What changes at these stakes is what the results are for. They stop being an internal quality habit and become the evidence your compliance team files and your regulator eventually reads.

Validate on your own population before launch. A vendor's accuracy number was earned on the vendor's data, at the vendor's threshold, and your patients, claimants, and customers are a different distribution. Run the system against cases drawn from your own records, scored against your own definition of correct, and keep that run, dated and signed.

Revalidate after every change. Model upgrades, prompt edits, retrieval changes, and threshold moves rerun the full set before they ship, with results filed where compliance can find them.

Turn your must-never list into tests that block the release

Earlier in this part you drew an envelope and a must-never list, the decisions about what the product may do, must refuse, and must hand to a person. Here you turn those documents into running tests. Each line becomes a golden case, a fixed input with a required outcome, and the whole set runs on every change alongside your quality set.

Advice baits that must refuse. The prompts engineered to pull the model into the regulated act: which fund to buy, whether to skip a medication, whether to accept the settlement. The only passing output is a refusal that routes the person forward.
Handoff triggers that must hand off. The account-specific question, the user in distress, and the request that needs a license must each end in the handoff itself, with context attached, or the case fails.
Disclosure cases that must disclose. Where the user must be told they are talking to software, or that an output is not advice, the disclosure appears in the output itself, not in a footer.

A release that has not passed the suite does not ship, because the suite is just the promises your firm already made, written as tests a machine can check. When we review a regulated build, the suite run is the first thing we ask for, ahead of any demo.

Claims about safety and performance are regulated speech

The same standard applies to what you say in public. Babylon Health's claims about its chatbot triage drew a published rebuttal in The Lancet in 2018, and the company collapsed in 2023. In regulated industries, public claims about safety and performance are themselves regulated, so the only claims worth making are the ones that still hold up when an outsider reruns your numbers.

Your legal and compliance teams are the authority on what the product may claim; your job is handing them evidence that holds: validation on your population, suite results on every release, and the records below.

The replayable record: rebuild any interaction months later

Complaints, audits, and incident reviews arrive months after the interactions they concern, and by then the system that produced the output is gone: the model upgraded, the prompt rewritten, the document set revised. A bare transcript shows what was said, but the question you will face is what produced it. Answering that takes a replayable record, which is one entry per interaction that stores the exact version of everything that shaped the output.

The model version. Record the pinned identifier, not the moving alias; providers ship upgrades under the same name.
The prompt version. Store a reference into version control so the exact instructions are retrievable.
The retrieved documents, at their versions. The answer was assembled from them, and they change on their own schedule.
The gate decisions. Record which checks ran and what each blocked, flagged, or passed.
Any human review. Capture who looked, what they changed, and when.

Retention is law, not housekeeping

How long the record lives is not a storage decision. SEC Rule 17a-4, the books-and-records rule for US broker-dealers, modernized in 2022, requires business communications to be kept and reproducible, and chatbot output sent to a customer is treated as exactly that. The rules carry real penalties: the 2022 off-channel sweep brought about 1.8 billion dollars in fines across more than a dozen firms for messages the firms never kept.

So retention is designed in: the evidence trail carries a schedule agreed with compliance, deletion is as deliberate as storage, and a log rotation default inherited from infrastructure is not a retention policy.

Keep watching after launch, because the duty never ends

Validation before launch and a suite on every change still leave the days in between, and regulators treat monitoring those days as a continuing obligation, not a one-time launch task. Give the signals a named owner and a regular review schedule.

Drift. The slow change in what users ask and what your documents say erodes accuracy with no release to blame.
New refusal failures. Sample production transcripts for cases that should have refused or handed off and did not; users invent baits your suite has not met, and each one found becomes a new golden case.
Gate-fire rates. A gate that stops firing has usually broken, and one that fires far more often is registering a shift in what users bring.

Write the incident playbook before the first incident.

What counts as an AI incident. Name the categories in advance: a wrong answer in a regulated flow, a missed handoff, a disclosure that failed to render, a gate bypass.
Who is told, and how fast. Put compliance and legal in the notification chain, with written thresholds rather than midnight judgment calls.
What gets frozen. Preserve the logs, versions, and records involved before the fix overwrites them.
What gets reported. Notification duties vary by industry and contract, so the playbook names which apply before anyone needs the answer under pressure.

Try it now

This is the replay test, run on your own product in about fifteen minutes.

Pick one real interaction from the last week. Choose one your product actually served where the answer mattered: a quote, a coverage explanation, a refusal.

Reconstruct it end to end. From your logs alone, write down which model version produced the output, which prompt version was live, which documents were retrieved at which versions, which gates fired and what they did, and who reviewed it, if anyone.

Turn every blank into a backlog item. Each field you cannot fill from records is a line item for the logging backlog, ordered by what an examiner would ask for first.

Scale it down: replay one interaction from staging instead. The blanks you find there are the same blanks production has.

Chapter Summary

A vendor's accuracy number was earned on the vendor's data and threshold, so it does not prove the product works on your patients, claimants, or customers.
Validate the system on your own records before launch, score it against your own definition of correct, and keep that run dated and signed.
Rerun both the quality set and the prohibited-behaviors suite after every change, and file the results where compliance can find them.
Turn your must-never list and handoff rules into golden cases, fixed inputs with a required outcome, and block any release that fails them.
Public claims about safety and performance are regulated, so make only the claims that still hold up when an outsider reruns your numbers.
Keep a replayable record for each interaction: the model version, the prompt version, the documents retrieved at their versions, which gates fired, and who reviewed it.
That record holds the same sensitive data the product does, so give it the same access rules, redaction, and retention as the product itself.
Retention length is set by law, agreed with compliance, and designed in; an infrastructure log-rotation default is not a retention policy.
After launch, watch for drift, new refusal failures, and changes in how often gates fire, and write the incident playbook before the first incident.
Next, Write your High-Stakes Clearance and ship a defensible product folds all of this into a single artifact.

Sources

Wong and colleagues (2021). External Validation of a Widely Implemented Sepsis Prediction Model, JAMA Internal Medicine.
Fraser and colleagues (2018). Published rebuttal on digital symptom checker triage, The Lancet.
U.S. Securities and Exchange Commission (2022). Amendments to Rule 17a-4 on electronic recordkeeping.
U.S. Securities and Exchange Commission and Commodity Futures Trading Commission (2022). Enforcement actions on off-channel communications.

Marks this chapter complete on your course map. Reaching the end does this for you.