If you've shipped an AI agent into production, you already know the drill. Initial eval set looks solid. System goes live. Users find failure modes nobody predicted. You scramble to debug from forwarded complaints while your users lose trust. Aurimas Cernius, writing in the SwirlAI Newsletter, calls this pattern broken by design—and he's got a framework that actually works. The core insight: most teams treat evals as something you write before shipping and forget about. That's backwards. The failure modes that matter are the ones traffic shows you, not the ones you can guess upfront. What works is a two-phase lifecycle where pre-production gets you onto a recurring improvement loop, and that loop processes production traffic to surface new failures, attach evals to them, and land improved versions continuously.
Pre-Production: The One-Time Foundation
Problem definition comes first—naming what the agent does, what counts as correct, and which behaviors are failures regardless of output quality (off-brand tone, ungrounded citations, missing required fields). Proof of concept follows: a throwaway implementation that confirms your model and tool surface can handle the task at all. Performance metrics get decided before the prototype, not after—these are business outcomes like average ticket resolution time, not LLM eval scores. The initial eval set comes from two sources: synthetic data for edge cases you can imagine, plus historical human work if you're automating existing processes.
The Agentic AI Flywheel
Once shipped, the system enters a recurring loop: Ship, Observe, Diagnose, Improve, Ship again. Each turn surfaces new failure modes, attaches evals to them, and produces a new version that satisfies most of the team's accumulated eval set. The eval set becomes your quality bar over time—not whatever the team remembers to check.
Observing Production Traffic
Every invocation generates trace data: LLM calls, tool invocations, intermediate outputs. User feedback (thumbs up/down, structured signals) adds another layer. Cernius notes that teams often stall here by requiring alerting infrastructure before running error analysis—but you don't need alerts on day one. Run the loop with what you have, add alerting when traffic volume demands it.
Diagnosing Failures
Trace and feedback data gets clustered into named error modes: hallucinated citations (agent cites knowledge-base articles that don't support claims), wrong tool selection, missed retrieval (answer exists but never entered context), broken output format, off-brand tone. The discipline here is eval-first ordering—write the failing test the moment you name the error mode, then schedule the fix separately. This mirrors test-driven development and prevents three failure patterns: no way to verify fixes work, evals that never get written after shipping patches, and tests that describe the shape of the fix rather than the original failure.
The Five Eval Types That Matter
Citation grounding checks use either programmatic string matching or LLM-assisted judges to verify cited sources actually support claims. Tool-use correctness is deterministic—label inputs with expected tool calls, compare actual to expected in pure code, no model required. Retrieval recall@k measures whether known-relevant documents land in the top-k retrieved set using decades of search-engine precedent. Schema validators do structural checks against JSON schemas or type definitions. LLM-as-judge with a rubric handles subjective quality like tone and brand voice—riskier because judge models drift and rubrics need versioning, but essential for covering what code can't grade.
Drift: The Loop's Fuel
Drift is what makes the flywheel necessary. Four signals indicate it's happening: input distribution shift (new vendor names, SKUs, intents you've never seen), eval score decay over time on the same test set, climbing thumbs-down rates, and latency or cost spikes that usually precede quality drops. When drift appears, you reopen the loop on the drifted slice—pull traffic into error analysis, cluster new modes, write evals, update context engineering levers, ship.
The Eval Set Is Your Central Artifact
Coverage grows on every cycle because the loop produces evals as a byproduct. New evals originate from Prototype (synthetic data and historical work) or Error Analysis (every named failure mode). The same set runs in CI/CD gates and as continuous monitors on production traffic. Over months, this is the difference between a system whose quality bar is whatever the team remembers to check, and one whose bar is every failure mode ever seen in production.
Key Takeaways
- Write evals at triage time, not when fixes land—decouple eval growth from engineering velocity
- Start error analysis day one with no alerting infrastructure required
- Your initial eval set comes from synthetic data plus historical human work if automating existing processes
- The eval set runs in three places: CI/CD gates, production monitors, and trace replay during diagnosis
- Watch for drift signals continuously—input shift, score decay, feedback rates, latency spikes
The Bottom Line
This isn't revolutionary stuff—it's test-driven development applied to AI systems with all the operational messiness that implies. But most teams still aren't doing it. If you're shipping agents without a flywheel like this in place, you're not building software—you're running an unpaid user research program while your users suffer the consequences.