Choose a model you can live with · The Builder's Stack

Make your first callPrompt like an engineer

A new model lands almost every week, each announcement claiming the crown, and it is easy to freeze, certain that whatever you pick will be old news by the time you ship. That instinct gets the question backwards: you are not trying to find the single best model in the world, only one that fits this build, and that is a much smaller, much calmer decision.

Choosing a model at this stage is picking a strong default and a few axes to judge it on, not chasing the top of a leaderboard that reshuffles every month.

The five axes that actually decide

Almost every model choice comes down to how a candidate scores on five things, weighed against what your feature needs.

Capability. Is it good enough at your specific task, whether that is writing, pulling data out of a document, reasoning through a problem, or reading an image? More capability than the task needs is just money spent.
Cost per call. You met this in Make your first model call: you pay per token, so a pricier model multiplies across every use. The gap between the top tier and the cheaper tier is often large.
Speed. How long the user waits for a reply. A heavier model takes longer to answer, which is fine for a background job and painful in a live chat.
Context window. The most tokens the model can take in one call, which caps how much you can hand it at once. You will lean on this in Give the model the facts it wasn't trained on.
Privacy and hosting. Where your data goes when you call the model, which the last section returns to.

No model wins on all five, so the work is matching the axes your feature cares about to a model that scores well on those.

Flagship or fast: start high, then drop where quality holds

Each major provider, OpenAI, Anthropic, and Google among them, ships at least two tiers: a flagship built for the hardest work, and a smaller, faster, far cheaper model for everything else. The smaller tier is more capable today than the flagship was a year ago, which is why it carries so much real production traffic.

The working pattern is to start with a strong default, usually a flagship or a strong mid-tier model, and get the feature working well first. Then, on the calls where a cheaper, faster model holds the same quality, drop to it and pocket the savings. You confirm that quality holds by testing, not by guessing, which is the whole subject of The Practice. Switching between models from the same provider is usually a one-line change, as long as you kept the model name in one place rather than scattered through your code.

How to compare: read the boards, then trust your own inputs

You do not have to benchmark models yourself. Several independent groups publish current rankings, and a few are worth knowing by name.

Arena ranks models by blind human preference: people compare two anonymous replies and vote, and the votes become a rating. It captures "which one feels better to a person" well.
Artificial Analysis is the most useful single view for a builder, because it plots quality, speed, and price together, so you can see what a little more capability actually costs.
Vellum tracks the harder, still-unbeaten public tests, which separate the current top models better than the older benchmarks that nearly everything now aces.
Scale SEAL and Epoch AI run more rigorous evaluations on private question sets, useful when you want a result a vendor could not have studied for in advance.

You will also hear benchmark names: SWE-bench Verified for real coding work, GPQA for hard reasoning, MMMU for understanding images. They are useful shorthand, but a benchmark measures a model in general, and you are shipping something specific.

Privacy and hosting: where your data goes is a decision

When you call a closed model over an API, your prompts leave your servers for the provider's. For most builds that is fine, but read two lines in the provider's terms before you send anything real: whether they retain your prompts, and whether your data trains future models. The major providers now offer settings or tiers that turn both off, and that is the setting you want for anything sensitive.

The alternative is an open-weights model you download and run on infrastructure you control, so the data never leaves. The price is that you become the one who patches and upgrades it. For a first AI feature, a closed API from a major provider, with retention and training turned off, is the sensible default. When you build for a regulated field, where custody is the first question and not the last, Choosing the right model in The Frontier covers the heavier screen.

Try it now

No setup: Name your feature's task in one sentence, then open Artificial Analysis and Arena and shortlist two or three models that look strong on the axes you care about. Then open a provider console and run the same three real inputs through your top two candidates, reading the replies, the speed, and the token counts side by side. You have now chosen a model on evidence instead of headlines.

With your tools: Ask Claude Code to make the model name a single configuration value in your project, then run your feature once on a flagship and once on the cheaper tier and compare the output, the latency, and the cost. Keep the cheaper one wherever its answers hold up. If your tools are not set up yet, The Setup Clinic gets you there in one sitting. In Codex or Cursor the move is the same: pull the model name into one config value and run the same input through two models to compare.

Chapter Summary

You are not looking for the best model in the world, only one that fits this build, which is a smaller and calmer decision.
Five axes decide most choices: capability, cost per call, speed, context window, and privacy or hosting. Match the ones your feature needs to a model that scores well on them.
Every major provider ships a flagship for hard work and a smaller, faster, cheaper model for everything else, and the cheaper tier handles a lot of real traffic.
Start with a strong default, get it working, then drop to a cheaper, faster model on the calls where quality holds, confirmed by testing rather than by guessing.
Keep the model name in one place so switching models or tiers is a one-line change.
Independent boards help you compare: Arena for human preference, Artificial Analysis for quality against speed and price, Vellum for the hardest current tests, and Scale SEAL and Epoch for rigorous private evaluations.
A leaderboard ranks models in general, so narrow to a few candidates on the boards, then judge them on five of your own real inputs.
Calling a closed API sends your prompts to the provider, so turn off retention and training for anything sensitive, or run an open-weights model yourself when you must hold the data.
Next up, Prompting is engineering, not wording gets far more out of whatever model you picked.

Sources

Arena (arena.ai), Artificial Analysis (artificialanalysis.ai), Vellum LLM Leaderboard, Scale SEAL, and Epoch AI, 2026.
OpenAI, Anthropic, and Google model and pricing documentation, 2026.

Marks this chapter complete on your course map. Reaching the end does this for you.