Part 5, the finale, of a series on building production AI on .NET. We’ve built the pieces — what evals are, error analysis, golden datasets, and a trustworthy judge. Now we make them earn their keep.
By now you can produce a defensible quality score for an AI feature. But a score you only look at is a vanity metric. The entire point of all that work is to make quality something your engineering process acts on automatically — the same way a failing unit test stops a bad commit. That means two homes for your evals: a gate before you ship, and monitoring after.
Home 1: CI — a safety net against regressions
Because TextStack’s judge is a custom IEvaluator on Microsoft.Extensions.AI.Evaluation, an eval is just a dotnet test. The MEAI evaluator emits the rubric’s axes plus an overall as numeric metrics, and a quality floor is expressed as a Pass/Fail interpretation on the overall:
// In the evaluator: the overall metric is interpreted Pass/Fail against a floor.if (overallFloor is { } floor) overall.Interpretation = new EvaluationMetricInterpretation( RatingFor(score.Mean), failed: score.Mean < floor, reason: $"floor {floor:0.0} (mean {score.Mean:0.00})");
That catches gross breakage — “something is badly wrong.” But the more valuable gate is relative: store a baseline score per feature, and fail the build when a change drops quality by more than a threshold versus that baseline. That turns “did this prompt change help?” into a red/green answer and makes improving a prompt a tight loop — change, run, compare, keep or revert. It’s the AI equivalent of TDD.
Honest status from our codebase: the floor and on-demand runs exist today; the automatic baseline-versus-regression gate is the next step. I’m flagging that deliberately, because plenty of “we do eval-driven development” claims are really “we have a number nobody gates on.” The hard 80% — the measuring instrument — is built; wiring the ratchet is the lighter remaining 20%.
The constraint CI forces: evals cost money
Every eval case is a real generation plus a real judge call. Running the full suite on every commit is slow and expensive, so evals have to be deliberate. TextStack’s are opt-in: tagged so default CI skips them, and they self-skip when the provider isn’t configured.
OPENAI_API_KEY=… dotnet test tests/TextStack.AiEvals --filter Category=Eval
Default CI stays green and free; the expensive truth runs on purpose. The pragmatic pattern: a small, cheap subset on pull requests for a fast signal, and the full suite nightly or pre-release. Treat eval spend like any cloud cost — budget it, don’t let it run unbounded.
Home 2: Production — monitoring and guardrails
A curated golden set, however good, is a snapshot of inputs you imagined. Production sends inputs you didn’t. So the offline gate is only half the system; the other half runs against live traffic.
This is where evals and observability become one thing. Every AI call in TextStack is tagged with its feature and recorded — cost, latency, tokens, errors — and runs persist to an eval_runs table surfaced on an internal /ai-quality dashboard (Traces and Evals tabs), with an admin “Run evals” button to trigger the suite on demand. Because the judge is the same component offline and online, you can sample real outputs per feature and score them with the identical rubric. Two modes fall out of that:
- Background monitoring — sample a slice of live outputs, judge them, and watch the score over time to catch drift before users complain.
- Guardrails — for high-stakes outputs, judge in the critical path and block, retry, or fall back when a result fails. (Use sparingly: it adds a judge call’s worth of latency and cost to the request.)
The flywheel
Put the two homes together and you get a loop that compounds. Production surfaces a new failure mode → you do error analysis on it → it becomes a new golden case → your gate now defends against it → quality climbs → cleaner output produces cleaner traffic. Each turn makes the next regression harder to ship. That continuous-improvement flywheel — not any single dashboard — is the real product of an eval system.
The pitfalls
- A number nobody gates on — if a bad score can’t fail a build or page someone, it’s decoration.
- A fixed floor mistaken for a regression gate — a floor catches breakage, not a 2%-worse change. You want both.
- Evals on every commit — the bill and the wait will kill the habit; subset on PRs, full suite nightly.
- Offline-only — you’ll ship regressions from inputs your golden set never imagined.
- Guardrails everywhere — judging in the critical path is powerful but costs latency; reserve it for outputs that matter.
- Online scores you never read — monitoring you don’t look at is just a more expensive log.
The series, in one line each
That’s the whole discipline, start to finish:
- Evals are the test suite for non-deterministic code — graded judgement over a representative sample.
- Error analysis comes first — read your failures and name them; the taxonomy decides what to measure.
- The golden set is the ruler — representative, leak-free, fresh, and run through the real prompt and gateway.
- The judge is a model too — defensive, dedicated, routed, and validated against humans with Cohen’s κ.
- A score must become a gate — CI to catch regressions before ship, monitoring to catch drift after.
None of it requires Python or a heavyweight platform. On .NET it’s an ILlmService seam, a golden dataset in JSON, a custom IEvaluator on Microsoft.Extensions.AI.Evaluation, and an opt-in test category — built on a real product, in production. Done right, evals turn “I think this AI feature is fine” into “I can prove it, and I’ll know the moment it stops being true.” That’s the difference between shipping AI and gambling with it.
TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.
Leave a Reply