A feature is almost right. The model's answers are close but not quite in the voice you want, and the first idea that comes up, in the meeting or in your own head, is "let's fine-tune it." It sounds like the serious, real-engineering move. It is also the most expensive, slowest, and least reversible thing you could reach for, and most of the time a change you could have made in five minutes would have done the job. The skill is knowing which lever to pull, and in what order.
Customizing a model runs from cheap and reversible to costly and locking, instructions, then context, then tools, then fine-tuning, and the move is to reach for the cheapest lever that works and stop there.
The four rungs, cheapest first
Each rung bends the model further to your need, and each one costs more than the last.
- Instructions. Change the words you send. This is prompting, free to change and instant to test. A large share of "the model isn't doing what I want" ends right here, with a clearer role, a sharper task, or one good example.
- Context. Give the model your facts, by stuffing them into the prompt or retrieving the relevant few. This is the fix when the model is missing your facts, and it is still cheap and reversible.
- Tools. Let the model call your functions, so it can look up a real order, run an exact calculation, or check live data instead of guessing. This starts to make the model act, and the deep version, where the model strings tools together on its own, is the subject of When your product starts doing things in The Practice.
- Fine-tuning. Train the model further on your own examples, which produces a new model tuned toward your style or task. It is the heaviest rung: it costs money and time, and the result is a new model you now own and have to maintain.
Why fine-tuning is the last resort, not the first
The instinct to fine-tune first usually rests on a misunderstanding of what it does.
Fine-tuning teaches a model form and behavior, not facts, so it is the wrong tool for a missing-facts problem, which is what retrieval is for.
If your feature gives wrong facts, more training on examples will not fix it; the facts belong in the prompt or in retrieval. Fine-tuning shifts how the model responds, the style, the format, a narrow behavior, not what information it has. The mainstream guidance from the providers themselves is blunt about the order: exhaust prompting, then retrieval, and only fine-tune once you have an eval showing the cheaper levers genuinely fell short. There is also a real risk to weigh: research has found that fine-tuning a model, even on harmless-looking data, can weaken its trained refusals, so a tuned model needs its safety re-checked before it carries real traffic.
What fine-tuning is actually good for
It does have a place. When prompting, context, and tools all hold but you still need a very specific form at scale, a house style the model cannot reliably hit from instructions, or a narrow classification you run so often that a long prompt becomes a real cost, fine-tuning can be the right call. Even then it comes last, after the cheaper rungs have proven, with evidence, that they cannot get you there. In a regulated setting the stakes are higher still, because a tuned model is a new model your reviewers must validate from scratch, which Choosing the right model covers in The Frontier.
The rule in one line
Start at rung one and climb only when the rung you are standing on provably cannot do the job. Most features never leave instructions and context, and the ones that do should be able to say, with evidence, exactly why the cheaper lever was not enough. That sentence, "here is what we tried and here is why it fell short," is the difference between an engineering decision and an expensive guess.
Try it now
No setup: For the feature you are building, write down which rung it actually needs and one sentence on why the next rung up is unnecessary. If your honest answer was "fine-tune," test it: is the model wrong about facts, which is a context problem, or wrong in form, which a sharper prompt might fix? Most "we need to fine-tune" notes turn out to be one of those two in disguise.
With your tools: Ask Claude Code which rung your feature really needs, and to push the cheapest lever harder, a tighter prompt or better context, before anything heavier, naming what evidence would justify climbing to the next rung. If your tools are not set up yet, The Setup Clinic gets you there in one sitting. In Codex or Cursor the move is the same: ask which lever the problem actually calls for, and exhaust the cheap ones first.
Chapter Summary
- Customizing a model runs from cheap and reversible to costly and locking: instructions, context, tools, then fine-tuning.
- Reach for the cheapest lever that solves the problem and stop there; most features never climb past instructions and context.
- Instructions mean changing the prompt, the free and instant fix where most problems end.
- Context means giving the model your facts by stuffing or retrieval, the fix when the model is missing your facts.
- Tools let the model call your functions for exact lookups and actions, with the deep agentic version waiting in The Practice.
- Fine-tuning trains a new model on your examples, and it is the heaviest rung: costly, slow, and a new model to maintain.
- Fine-tuning changes form and behavior, not facts, so it is the wrong tool for a knowledge gap, which retrieval solves cheaply.
- The providers' own guidance is to exhaust prompting and retrieval, proven by an eval, before fine-tuning, and a tuned model needs its safety re-checked.
- Climb a rung only when you can say, with evidence, why the cheaper one fell short.
- Next up, Wire the model into your build puts everything in this part together into a feature you ship.
Sources
- OpenAI model optimization and fine-tuning best-practices documentation, 2026.
- IBM, "RAG vs. fine-tuning vs. prompt engineering," 2026.
- Qi and colleagues (2023), research on fine-tuning weakening model safety.