Choosing the right model · The Builder's Stack

What AI may never doThe right context

In early 2023, a product manager at a large bank pasted a customer email thread into a public chatbot and asked for a draft reply. The reply came back in seconds, polished and on tone, better than anything the firm had approved.

By February 2023, JPMorgan, Bank of America, Citigroup, and their peers had restricted exactly that use, not because the models were weak, but because nobody could answer where the prompts went, who retained them, or whether they would train the next model. Customer correspondence had left the building, and no one could account for it.

In a regulated room, the first model decision is not capability but custody: who holds what you send the model, for how long, and with what rights to reuse it.

Capability matters too, but it only gets a vote once you can put the custody answer in writing.

A language model counts as a model your reviewers must govern

Banking has governed models for over a decade under SR 11-7, supervisory guidance on model risk management issued by the Federal Reserve and the Office of the Comptroller of the Currency in 2011, and that guidance is the template your reviewers will reach for. Its expectations are concrete.

An inventory. Every model the institution relies on is registered, with a named owner and a stated purpose.
Independent validation. Someone who did not build the model examines it before use and signs off.
Production monitoring. The model's behavior is tracked after launch, as the world changes.
Periodic challenge. On a schedule, a qualified reviewer probes whether the model still performs and still fits its purpose.

A language model is a model under that definition, and nothing in the guidance exempts systems that produce prose. SR 11-7 formally binds banks, but the same questions are now arriving in healthcare, insurance, legal work, and government procurement, so wherever you operate, expect to be asked the things this guidance asks.

Zillow shows why model risk management exists

If the routine reads as bureaucracy, look at what it exists to prevent, in a case with no language model anywhere in it. In 2021, the home-pricing algorithm behind Zillow Offers drifted away from the market it was pricing, and the company wrote down over half a billion dollars and closed the unit.

The pattern is the one model risk management was designed to catch. A model is validated, performs well, earns trust, and keeps that trust after the conditions it was validated against have moved on. Monitoring and periodic challenge exist to catch that drift while it is still a number on a dashboard instead of a write-down, and choosing a model for regulated work commits you to that routine for as long as the model stays in production.

Run the procurement screen before a model carries regulated work

Before any model touches a regulated workflow, you want written answers, with evidence, to the questions below. We call this the procurement screen, because the answers separate a model your risk team can approve from one they can only block. Legal and compliance own the final read on every line; your job is to arrive with the lines filled in.

Data use. Are prompts and outputs retained by the vendor, and does anything you send train future models? This is the question the banks could not answer in February 2023.
Residency and region. Where are prompts processed and stored, and can you pin both to the regions your obligations name?
Retention windows. How long does the vendor hold prompts, outputs, and logs, and can you shorten the window, down to zero, by contract?
Security attestations. Can the vendor show SOC 2 Type II or ISO 27001, the independent audits that confirm security controls exist and keep operating?
Model documentation. Is there a model or system card, a published account of how the model was built, evaluated, and where it fails, that you can hand to your risk team as is?
Indemnification. If an output causes harm or infringes, who carries the cost, and is that written into the contract rather than assumed?
Exit plan. If you had to leave this vendor in ninety days, what moves with you, including prompts, logs, and evaluation results, and what do you lose?
Concentration risk. If this one vendor has an outage or changes its terms, how much of your product stops working?

Put together, the answers become a dossier you can hand to your risk team at the start of review, instead of scrambling to assemble one under a deadline.

Choosing open weights or a closed API decides who holds custody

Open weights, meaning model parameters you can download and run on infrastructure you control, answer much of the screen on their own. Prompts stay on hardware you govern, residency is wherever you deploy, and retention is whatever you build. The price is that you become the party who patches, upgrades, and validates the model, with no vendor to lean on when its behavior regresses.

A closed API is the reverse trade. You get stronger capability and faster patches, but custody now lives in your contract, in clauses about retention, training use, and audit rights, rather than in your own architecture. Either can be the right call: a hospital network with hard residency obligations may take the less capable model it can run inside its own walls, while a lender may accept contractual custody to get the strongest hosted model. What matters is which side of that tradeoff your regulator and your evidence plan can live with.

Prompting, retrieval, and fine-tuning each commit you more than the last

Adapting a model to your domain runs from light to heavy, and each step locks you in further.

Prompting changes only the instructions you send. You can revise as fast as you can re-run your evaluations, and nothing new enters the model inventory.
Retrieval has your system fetch your own documents into the prompt at request time, grounding outputs in sources you control while the model stays unchanged.
Fine-tuning adjusts the model's weights on your data, and this step is different from the other two: it produces a new model, which enters the inventory as its own entry and must be validated before use.

A 2023 study found that fine-tuning even on harmless data can weaken refusal behavior, the trained tendency of a model to decline requests it should not serve. So tune for form and domain language, the vocabulary and structure your field requires, never as a shortcut around controls, and re-run your refusal tests before the tuned model carries real traffic.

Try it now

The drill is the procurement screen run against your own product, on paper, in about fifteen minutes.

Pick the model that carries your product. Use the model in production today, or the one your current plan names if you have not shipped.

Write out the screen. Copy the questions from this chapter onto one page, data use through concentration risk, and answer each in a single line.

Mark each answer known, unknown, or assumed. Known means you could produce the document tomorrow, assumed means you are relying on something a vendor or colleague said once, and unknown means exactly what it says.

Convert the gaps into follow-ups. Every unknown and every assumption gets an owner and a date this week. If the data use answer is unknown, treat it as a stop-ship, because that is how a regulated review will treat it.

Chapter Summary

In a regulated setting, the first question about a model is not how good it is but custody: who holds what you send it, for how long, and with what rights to reuse it.
A language model counts as a model under SR 11-7, so your reviewers will expect an inventory, independent validation, production monitoring, and periodic challenge.
The same expectations are spreading beyond banking into healthcare, insurance, legal work, and government, so plan for them wherever you operate.
The Zillow write-down shows why this matters: a model that was fine when it launched drifted as the market moved, and monitoring is what catches that before it becomes a loss.
Before a model touches regulated work, run the procurement screen and get written answers, with evidence, on data use, residency, retention, security attestations, model documentation, indemnification, exit, and concentration risk.
If the data-use answer is unknown, treat it as a stop-ship, because a regulated review will.
Open weights keep custody in your own infrastructure but make you responsible for patching and validating; a closed API gives more capability but moves custody into your contract. Pick the side your regulator and your evidence can live with.
Prompting and retrieval leave the model unchanged, but fine-tuning produces a new model that needs its own inventory entry, its own validation, and a fresh refusal test before it carries real traffic.
Most of what your system produces depends on what you put in front of the model at request time, which is where Building the right context picks up.

Sources

Federal Reserve and Office of the Comptroller of the Currency (2011). Supervisory Guidance on Model Risk Management, SR 11-7.
Press reporting (2023) on bank restrictions of public chatbots.
Zillow shareholder letter and contemporaneous reporting (2021) on the Zillow Offers wind-down.
Qi and colleagues (2023). Fine-tuning Aligned Language Models Compromises Safety.

Marks this chapter complete on your course map. Reaching the end does this for you.