AIJan 5, 20267 min read

The cheapest model that passes your evals

The right model is the cheapest one that still passes your evals, and most teams are paying for headroom they never use.

Most teams pick a model the way they pick a laptop: they buy the biggest one they can justify and assume the headroom will pay off later. So the support classifier runs on the frontier model. The internal tagging job runs on the frontier model. The thing that turns a messy address into structured fields runs on the frontier model. None of these tasks are hard. They're running on an expensive model because nobody ever asked the only question that matters: what is the smallest model that still does this job?

The right model for a task is the cheapest one that passes your evals. Not the smartest one available, not the one with the best benchmark scores, not the one your competitor announced they're using. The cheapest one that clears the bar you actually set. If you don't have that bar written down as a test you can run, you're not choosing a model — you're buying reassurance.

Most of your tasks are not hard

The frontier exists for genuinely hard problems: multi-step reasoning, code that has to be correct, judgment calls with real ambiguity. Those are real, and they're worth paying for. But take an honest inventory of what your product actually sends to a model, and most of it is not that.

Classifying a ticket into one of eight buckets. Extracting four fields from an invoice. Deciding whether a comment is spam. Rewriting a sentence to be more polite. Summarizing a paragraph that was already pretty clear. These are tasks a small model handles at near-identical accuracy for a fraction of the cost and a fraction of the latency. The frontier model gets them right too — it just charges you ten or twenty times as much to do work a cheaper model would have nailed.

The waste hides because each call is cheap in isolation. A tenth of a cent feels like nothing. Then you run it forty million times a month and discover you've been paying frontier prices to decide whether something is spam.

Build the eval first, then shop

You cannot pick the cheapest passing model if you have no definition of passing. This is the step teams skip, and skipping it is why they default to the expensive model — it feels safe precisely because they never measured.

An eval doesn't need to be elaborate. For most tasks it's a few hundred real examples with known-correct answers and a script that scores a model's output against them. Pull the examples from your actual traffic, not from a benchmark. Your traffic has the typos, the half-sentences, the edge cases that benchmarks sand off.

eval.ts


const bar = 0.95;
for (const model of candidates) {
const score = runEval(model, cases);
if (score >= bar) return model; // candidates sorted cheap to expensive
}

Pick the cheapest model that clears the bar.

Sort your candidates from cheapest to most expensive, run each against the eval, and stop at the first one that clears your bar. That's the model. The exercise usually surprises people: the cheap model passes far more often than the org's instinct assumed, and the few tasks that genuinely need the frontier become obvious because they're the ones where the cheap models visibly fall apart.

Headroom you never measure is not safety margin — it's a recurring bill for confidence you could have bought once.

Headroom is a cost, not a hedge

The argument for the bigger model is always the same: what if the inputs get harder, what if an edge case slips through, what if we need the extra capability later. So the team keeps the expensive model running on easy work as insurance.

But insurance you never price is just waste with a story attached. The edge cases either show up in your eval or they don't. If they show up, your bar catches them and the cheap model fails and you move up — correctly. If they don't show up, you're paying every single day for a failure mode that isn't in your data. The honest move is to put the scary inputs into the eval set and let the test decide, instead of carrying a permanent premium against a fear you never quantified.

There's a quieter cost too. The big model is usually the slow model. When a small model returns in 300 milliseconds and the frontier takes two seconds, you're not just overpaying — you're shipping a worse product. Latency is a feature, and on simple tasks the cheap model often wins on both axes at once.

Make it a routine, not a one-time audit

Models get cheaper and better on a cycle measured in months. The cheapest model that passed your eval last quarter has probably been undercut by something newer that also passes. If you choose once and forget, you lock in last year's prices on this year's volume.

So wire the eval into something you run on a schedule, the same way you'd re-benchmark a hot code path.

→Keep the eval set in version control next to the code that calls the model.
→Re-run it against the current model lineup when a new one ships, or quarterly at minimum.
→Track cost-per-pass, not just accuracy, so a model that ties on quality but halves the price wins automatically.
→Add every production failure you find back into the eval, so the bar only ever gets stricter.

Done this way, model selection stops being a vibe and a vendor relationship and becomes a number you can defend. When someone asks why the tagging job runs on a model that costs a fifth of the frontier, the answer isn't "it felt fine" — it's "here's the eval, here's the bar, here's the cheapest thing that clears it."

The frontier model is a fine default for the handful of tasks that genuinely need it. For everything else, it's headroom you're renting by the call. Write the eval, sort the candidates cheap to expensive, and buy exactly as much model as the job requires — no more.

#AI#CostShare ↗

→ / AUTHOR

Ionut Dumitru

Full-stack engineer and product designer. Writes about building products where the engineering and the design are the same job.

GitHub ↗X ↗

→ / NEXT

EngineeringDec 29, 2025

Your build is part of your product →

← All writingionutdumitru.com