Golden Datasets That Don’t Lie

Part 3 of a series on building production AI on .NET. Part 1 was the overview; Part 2 was error analysis. Now we turn the failure taxonomy you built into something you can measure against — without quietly fooling yourself.

A golden dataset is a set of representative inputs, each paired with a reference answer a knowledgeable human would accept. It’s the ruler you hold every model output against. And it is, in my experience, the single most important and most neglected asset in an eval pipeline — because a sloppy ruler doesn’t announce itself. Your scores still come out green. They’re just measuring the wrong thing.

This post is about building a golden set that tells the truth.

What it looks like in practice

In TextStack, each AI feature has ~30 hand-curated cases stored as plain JSON, loaded at runtime into a typed record that mirrors exactly what the production endpoint receives:

			
public record ExplainGolden(
    string Word,
    string Sentence,
    string? Genre,
    string TargetLang,
    string ExpectedExplanation);

		

Plain JSON on disk, deserialised case-insensitively. No database, no platform lock-in — the dataset is a checked-in artifact you can diff in code review:

var goldens = GoldenData.Load<ExplainGolden>("explain.json");

The format is the easy part. The honesty is in four properties of the content.

1. Representativeness — mirror reality, not the demo

Your set should reflect the real distribution of inputs your feature meets in production, including the hard, weird, and adversarial cases. This is where Part 2 pays off: the failure taxonomy tells you which kinds of input break things, so you deliberately stock the set with them.

The opposite — a set of only easy, happy-path cases — is the most common way an eval lies. The model aces them, your average climbs, and meanwhile the inputs that actually matter never get measured. Stratify on purpose: domains, lengths, languages, edge cases. For TextStack’s Explain set that means technical passages and casual prose, common words and rare ones, several target languages — not thirty variations of the same easy lookup.

2. Reference quality — the ceiling you measure against

The reference answer defines what “good” means for that case, so a lazy reference caps the meaning of your whole score. If the reference for explaining idempotent is a paraphrased dictionary entry, your judge will happily reward dictionary entries — the exact failure mode you were trying to eliminate.

References should be written or vetted by someone who understands the domain. For Explain, that means genuinely good in-context explanations: what the word means here, in this sentence, the way you’d want it explained to you. The reference is the bar; set it where you actually want the product.

3. Leakage — keep a real train/test split

Here’s the subtle statistical sin. If you tune your prompt against the same cases you score against, you’re overfitting to the test, and your number is fiction — you’ve optimised for those thirty examples, not for the feature. It’s the prompt-engineering version of training on your test set.

Keep a slice you never look at while iterating. Tune on one part; report on the held-out part. This feels heavy for thirty cases, but the discipline is what keeps the score meaningful as you iterate. The split is just as real for prompts as it is for model weights.

4. Size and freshness — a floor, and a living asset

Thirty cases is a deliberate floor, not a target: enough to catch gross regressions cheaply, small enough to run often and to keep every reference high quality. (It’s statistically thin for detecting small changes — that’s the next post’s problem.) More important than size is that the set is alive: every new failure mode you find in production should earn a new case. A golden set that never changes slowly stops resembling reality, and a stale ruler is a lying ruler.

When you genuinely lack real examples — a brand-new feature with no traffic — you can bootstrap with synthetic cases (have a strong model generate realistic inputs across your taxonomy’s dimensions). It’s a legitimate starting point, but treat it as scaffolding: replace synthetic cases with real ones as traffic arrives, because real users are more creative than any generator.

The silent killer: dataset drift from production

Now the trap that quietly invalidates an otherwise perfect golden set, and the one I’d most want a reviewer to check for.

You write your feature’s prompt in the API endpoint. You write the eval, and — naturally — you write the prompt again in the test. Two copies. Someone tweaks the production prompt for a hotfix and doesn’t touch the test copy. From that moment your eval measures a prompt that no longer exists in production. The score stays green; the product changed underneath it. Nobody notices, because the test reports with total confidence.

The fix is structural, not disciplinary: extract the prompt into one builder that both production and the eval call. There is no second copy to drift.

			
// Built once, called by BOTH the endpoint and the eval — they cannot disagree.
public static class ExplainPrompt
{
    public static string BuildSystemPrompt(string? genre, string targetLang) => /* ... */;
    public static string BuildUserPrompt(string word, string sentence) => /* ... */;
}

		

The eval’s case-to-request mapping wires that shared builder straight in, and crucially the request goes through the same model gateway production uses, selected by the feature’s tag:

			
private static LlmRequest ToRequest(ExplainGolden g) => new(
    SystemPrompt: ExplainPrompt.BuildSystemPrompt(g.Genre, g.TargetLang),
    Messages: [new LlmMessage("user", ExplainPrompt.BuildUserPrompt(g.Word, g.Sentence))],
    MaxOutputTokens: 500,
    FeatureTag: "explain"); // same routing, same model, same path as prod

		

If you remember one thing from this post: an eval that runs a copy of the prompt is worse than no eval, because it manufactures false confidence. Same prompt, same gateway, same path — or you’re measuring a ghost.

The pitfalls

A happy-path-only set — the score rises while the product falls. Stock it from your failure taxonomy.
Weak reference answers — they cap your score’s meaning and can reward the very failure you’re chasing.
Train/test leakage — tuning and scoring on the same cases overfits to fiction.
A frozen set — inputs drift; a dataset that never grows slowly measures a product that no longer exists.
Synthetic-forever — fine to bootstrap, dangerous to rely on; real traffic is weirder.
A duplicated prompt — the drift trap. One shared builder, through the real gateway.

The takeaway

A golden dataset is not a formality you generate once and forget. It’s a carefully curated, honestly-split, continuously-refreshed ruler — and it has to run the real prompt through the real path or it measures nothing. Get the dataset right and every downstream number means something. Get it wrong and you’ve built an instrument that lies to you in green.

Next in the series: LLM-as-judge, done right — how to turn a paragraph into a trustworthy number, the biases that wreck judges, and why your judge needs its own eval.

TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.

Vasyl’s Dev Notes

Leave a ReplyCancel reply