AIJul 21, 20257 min read

Fine-tuning is the answer to a question you probably don't have

Fine-tuning feels like the serious option, which is exactly why it's usually the wrong first move.

Fine-tuning is the answer to a question you probably don't have. The question it answers is narrow: how do I make a model reliably produce a specific form, tone, or output shape that I cannot get from prompting and examples alone? Most of the time, the problem you actually have is something else wearing that costume — missing context, a vague spec, or a retrieval gap you haven't built yet.

I've watched teams reach for fine-tuning the way people reach for a rewrite: it promises that the messy, accumulated work of getting the basics right can be skipped in favor of one clean, decisive move. It almost never can. The model isn't the bottleneck. The bottleneck is that you haven't yet written down, precisely, what good looks like — and fine-tuning makes that omission permanent and expensive instead of cheap and editable.

What fine-tuning quietly costs

The training run is the part everyone budgets for. The part nobody budgets for is everything that surrounds it. You need a labeled dataset that genuinely represents the behavior you want, which means someone has to define that behavior well enough to produce a few hundred to a few thousand clean examples. That definition work is the actual product. If you can write that spec, you've usually also discovered you didn't need to fine-tune.

Then the dataset rots. Your domain shifts, your tone guidelines change, a new edge case shows up in support tickets, and your fine-tune still answers like it's three months ago. A prompt you can edit in an afternoon. A fine-tune you re-train, re-evaluate, and re-deploy. You've traded a config change for a release.

→A prompt change is reviewable in a pull request. A weight change is a black box you can only judge by its outputs.
→Base models improve every few months. A fine-tune pins you to the one you trained on, and migrating forward means redoing the work.
→You now own an evaluation harness, because "it seems better" is not a deployment criterion for something you can't read.

Fine-tuning doesn't remove the hard work of defining good behavior — it converts that work into a form you can no longer edit by hand.

Try the cheap things first, in order

Before any training run, there's a ladder. Most teams find their answer two or three rungs up and never need the rest.

Start with the prompt. Not a paragraph — a real spec, with the role, the constraints, the format, and the failure modes spelled out. Then add examples directly in the context; few-shot demonstrations move output quality further than people expect, and you can change them in seconds. If the problem is that the model doesn't know your facts, that's retrieval, not fine-tuning — give it the documents at inference time. If the problem is multi-step reliability, that's orchestration: break the task into checkable stages and validate between them.

Only when you've climbed all of that and still have a real, measured gap — a form the model won't hold, a latency budget that forbids long prompts, a token cost that doesn't survive contact with scale — does fine-tuning become the honest answer. And by then you'll have the dataset and the eval harness as a byproduct of the climb, which is the only way fine-tuning works anyway.

The shape of a sane gate looks like this: you have an eval, it scores the prompted system, and the score is good enough to ship even before you consider training.

gate.ts


const baseline = evaluate(promptedSystem, testSet);
if (baseline.passRate > 0.9) ship(promptedSystem);
else if (baseline.passRate < 0.5) fixThePromptFirst();
else considerFineTuning(baseline.failures);

Fine-tune only when the prompted baseline still misses a measured bar.

The legitimate cases, named honestly

Fine-tuning earns its keep in a handful of places, and they share a trait: the requirement is structural, not informational. You want a model to consistently emit a rigid output format that prompting keeps drifting away from. You're distilling a large, expensive model's behavior into a small, cheap one to cut latency or cost at high volume. You have a tone or domain idiom so specific that examples in context can't carry it reliably, and you have the thousands of examples to teach it. You're operating at a scale where shaving prompt tokens pays for the whole training pipeline several times over.

Notice what's absent from that list: "the model doesn't know about our product," "it gets facts wrong," "it won't follow our policy." Those are context and retrieval problems. Fine-tuning a model to memorize facts is the most expensive and least reliable way to store information ever devised — the facts change, the weights don't, and you can't see what it actually learned versus what it confidently invented.

The tell is always the same. If you can't write the evaluation that proves the fine-tune is better, you are not ready to fine-tune. And if you can write that evaluation, run it against a well-prompted base model first. Most of the time the gap you were going to spend a month closing was a gap in the spec, not the weights.

Fine-tuning is a real tool with a narrow, legitimate job. Reach for it last, on purpose, after the cheaper rungs have failed a test you actually wrote. The teams that get the most out of AI aren't the ones training the most models — they're the ones who keep their behavior in editable text for as long as they possibly can, and only bake it into weights when there's nothing left to learn by editing.

#AI#Fine-tuningShare ↗

→ / AUTHOR

Ionut Dumitru

Full-stack engineer and product designer. Writes about building products where the engineering and the design are the same job.

GitHub ↗X ↗

→ / NEXT

CraftJul 14, 2025

Design reviews should critique decisions, not pixels →

← All writingionutdumitru.com