AIApr 20, 20268 min read

The eval is the spec

Most AI features fail because no one wrote down what good looks like; the eval set is where you finally do.

Most AI features die in a meeting where everyone nods. Someone demos a prompt, the output looks plausible, the room agrees it's "pretty good," and the feature ships on that vibe. Three weeks later support tickets arrive, and now there's an argument about whether the model regressed or whether it was always this bad. Nobody can settle it, because nobody ever wrote down what good looked like. There's no artifact to point at. There's just the memory of a demo that felt fine.

The eval set is that artifact. A few hundred labeled examples — input, expected behavior, and the line between pass and fail — is the most honest product spec you will ever write, because it forces you to say what you actually want before you find out whether you got it. Most of the time, the painful part isn't running the eval. It's discovering, while writing it, that the team disagreed about the feature the whole time.

Writing the eval is writing the spec

A normal spec gets to stay vague. "Summarize the document accurately and concisely" survives review because every word is a place to hide. An eval doesn't allow that. To label one example you have to answer the questions the prose let you skip: Is a four-sentence summary too long? If the document contradicts itself, should the summary surface the conflict or pick a side? If a number is missing, is "not specified" a pass or a failure?

You can't write the example until you've answered. So you answer. And the moment you do, you've made a product decision that used to live, unspoken and inconsistent, in whoever happened to review the output that day.

This is why the first fifty examples take so long and feel so unpleasant. It isn't the labeling. It's that you keep hitting cases where two reasonable people on the team want opposite things, and the eval refuses to let you ship the ambiguity. The disagreement was always there. The model just dragged it into the light.

The model didn't make the spec ambiguous. It exposed that the spec was ambiguous all along.

Cases beat criteria

The instinct is to write the spec as rules: be accurate, be concise, cite sources, refuse off-topic requests. Rules read well and decide nothing. "Be concise" doesn't tell you whether the model failed on a given output — two people will read the same paragraph and disagree. A labeled case does tell you. It's a fixed point. Either the output matches the expected behavior or it doesn't, and when you disagree with the label, you argue about that one concrete example until you've sharpened the rule behind it.

So the unit of a real spec isn't the criterion. It's the case. Build the set deliberately:

→Mine production for the inputs that actually occur, not the clean ones you imagine.
→Over-weight the edges — empty inputs, hostile inputs, the adversarial customer, the malformed document.
→Keep the boring middle too, or you'll optimize the rare case and regress the common one.
→Write down why each tricky example is labeled the way it is. The reason is the spec; the label is just its shadow.

A few hundred cases chosen this way pin down behavior far more precisely than any document of rules, because each one closes a door the prose left open.

Make the failures legible

A pass rate is a number, and a number is the least interesting thing the eval produces. The value is in reading the failures. When you sit with the cases the model got wrong, you find that half of them aren't model failures at all — they're examples you mislabeled, or examples where the team still doesn't agree on the right answer. That's the eval doing its real job: not grading the model, but auditing your own definition of correct.

Wire the threshold into the path that actually ships, so the spec has teeth instead of living in a notebook someone runs once a quarter.

gate.ts


const score = runEvals(suite);
if (score.pass < 0.92) {
throw new Error("Below spec — blocking deploy.");
}

The spec runs in CI, not in a meeting.

And separate the two ways a case can fail, because they send you to different places. A low-confidence answer is a candidate for review; a confident wrong answer is the dangerous one and belongs in its own bucket.

Generated answer

The deadline is April 15.

Confidence

64%

↗ irs.gov↗ state bulletin

Fig. 1 — Confident enough to show, uncertain enough to flag. The eval decides which.

The spec is alive

The document spec is written once and rots quietly. The eval can't, because production keeps feeding it new cases. A weird input shows up, a user does something nobody imagined, the model fumbles it — you add it to the set with a label, and the spec just got sharper. Over a few months the eval stops describing the feature you planned and starts describing the feature you actually have, which is the only version worth shipping against.

This is also what makes model upgrades boring, in the good way. A new model drops, you run the suite, and you get an answer in minutes instead of a week of nervous spot-checking. Regressions show up as specific failed cases you can read, not as a vague sense that something feels off. You're comparing against a written definition of good, so the comparison is mechanical.

None of this requires a platform or a framework. It requires the willingness to write down what good looks like, one concrete example at a time, before you let the model — or your own optimism — tell you that you've already achieved it. Write the eval first and you'll usually find you didn't have a spec. You had a hope.

#AI#EvalsShare ↗

→ / AUTHOR

Ionut Dumitru

Full-stack engineer and product designer. Writes about building products where the engineering and the design are the same job.

GitHub ↗X ↗

→ / NEXT

EngineeringApr 13, 2026

Delete code like you mean it →

← All writingionutdumitru.com