An AI Feature Has No “Tests Pass” Moment. So I Write the Eval First.

I was building an “Ask This Book” feature: readers can ask questions about a book while they’re reading it.

One requirement sounded simple:

A reader on chapter 3 must never receive spoilers from chapter 30.

My first instinct was the same as everyone else’s:

Tell the model not to spoil future chapters. Something like:

“Please don’t reveal information from chapters the reader hasn’t reached yet.”

And honestly, it mostly worked.

The problem is that “mostly” is useless. A user only needs one spoiler.

That was the moment I realized the feature had no definition of done.

With normal software, something pushes back. The compiler complains. The tests fail. The types don’t line up.

With an LLM feature, none of that happens. The output looks plausible by default — fluent, confident, well formatted — even when it’s wrong.

So “it looked right in the demo” quietly becomes the finish line.

That’s exactly why I write the eval before I write the feature.

The Eval Is the Specification

Most teams treat evals as QA. Build the feature, ship something that works, add evals later.

I increasingly think that’s backwards. For AI systems, the eval is often the only concrete definition of success.

The moment I wrote the spoiler eval, I had to define failure: spoiler leakage must be zero. Not low. Not acceptable. Zero.

And that requirement immediately exposed a problem. No prompt can guarantee zero.

Prompts are probabilistic. Users can phrase questions differently. Models can interpret instructions differently. Future model updates can behave differently. You cannot get a hard guarantee from a soft instruction.

The Eval Changed the Architecture

Once the eval demanded zero spoilers, the solution stopped being a prompt problem. It became a retrieval problem.

Instead of telling the model not to reveal future chapters, I prevented future chapters from entering the context at all:

WHERE chapter_ord <= @maxChapterOrd

Anything beyond the reader’s progress never enters the retrieval set. The model can’t leak information it never saw.

And the eval that checks it is just as blunt — a retrieved chunk past the reader’s progress is a leak:

// One retrieved chunk past the reader's progress = one spoiler leak.
public static int LeakCount(IEnumerable<RetrievedChunk> retrieved, int gateChapterOrd) =>
    retrieved.Count(c => c.ChapterOrd > gateChapterOrd);

Across the adversarial test cases, that number has to be zero. That’s the moment the idea really clicked for me: the eval didn’t test the design. It produced the design.

A measurable failure condition forced a better architecture than I would have built if I had started with prompt engineering.

The Same Thing Happened to Retrieval Quality

The spoiler requirement wasn’t the only eval. I also defined two other targets before building the feature:

Retrieval must surface the correct passage near the top of the results.
Answers must remain grounded in the passages they cite.

Because those requirements were measurable, every change received a verdict instead of an opinion.

A single semantic search wasn’t clearing the bar. So I ended up combining two retrieval approaches:

vector search for semantic similarity
full-text search for exact names, phrases, and quotations

The results are fused using Reciprocal Rank Fusion — less mysterious than it sounds. Each chunk scores Σ 1/(k+rank) across the lists it appears in, so anything ranked highly by both retrievers floats to the top:

// ranked highly by both vector AND lexical → floats to the top.
scores[item] += 1.0 / (k + i + 1); // i is 0-based; RRF rank is 1-based

I didn’t choose hybrid retrieval because it’s fashionable. I chose it because it moved the number. The eval said the system wasn’t good enough. The architecture changed until it was.

A Note on the Stack

None of this is a no-dependencies flex. The judge that scores grounding is a custom evaluator on Microsoft.Extensions.AI.Evaluation:

public sealed class RubricEvaluator(string id, Rubric rubric) : IEvaluator

I lean on the Microsoft stack on purpose. What I keep hand-rolled is the part that decides quality — the retrieval, the fusion, the spoiler gate. The line I draw isn’t “no libraries.” It’s no agent framework hiding the parts that determine whether the thing actually works.

Eval-First Development

Traditional software development gives us confidence almost for free. Compilers. Type systems. Unit tests. Integration tests.

AI systems don’t. The difficult part isn’t implementing the feature. The difficult part is defining what “correct” means.

That’s why I increasingly think of eval-first development as the AI equivalent of TDD. With traditional software, tests verify the implementation. With AI systems, evals often define the implementation.

Build the feature first and the eval later, and the eval can only grade what you’ve already built. Build the eval first and it starts shaping the system itself.

It defines done. It tells you when you’ve regressed. And sometimes it forces a better architecture than the one you originally had in mind.

Otherwise you’re not shipping a feature. You’re shipping a guess that happened to demo well.

Vasyl’s Dev Notes

Leave a ReplyCancel reply