There's a moment every developer hits when their AI agent works beautifully in a notebook but falls apart the second it touches real users. Rizwan Saleem has written a comprehensive guide on DEV.to that tackles exactly this problem head-on, and it's exactly the kind of resource the community needs right now.
The Gap Between Notebook and Production
Saleem opens with a blunt truth: shipping an AI agent is not just "hook model to API." That stochastic magic you built locally becomes a liability when it starts making real decisions with real consequences. The guide assumes you're past the hello-world stageβyou've got an LLM-based agent (or graph of agents) running in a notebookβand now you need to make it observable, debuggable, and safe enough for production traffic.
Evaluation: From Gut Feelings to Measurable Results
The first major pillar tackles evaluation, because "feels good" is not a success metric. Saleem breaks this into offline and online evaluation strategies. For task-level metrics, he suggests different baselines depending on your use case: Q&A systems need correctness versus ground truth and resolution rates, workflow agents require task completion tracking with tool call counts and latency measurements, while code agents should be measured by test pass rates and production bug frequency. The guide recommends starting with a small but labeled eval set of 50 to 200 realistic user prompts with expected outputs or rubrics. For open-ended tasks where human review is expensive, LLM-as-judge with well-crafted rubrics can scale your evaluation process. The critical piece here is CI integrationβevery model change, prompt update, or agent-graph modification should trigger the full eval suite with "must not regress" guardrails like accuracy thresholds and toxicity limits.
Observability: Seeing Inside Agent Behavior
Agents are graphs of steps, not single API calls, which means your observability stack needs to capture the full execution trace. Saleem advocates for structured tracing where each request produces a hierarchical span: a root span for the incoming request (user, timestamp, context) with sub-spans for each agent step, tool call, model invocation, and external API interaction. Metadata is everything hereβprompt template name, model version, temperature settings, token counts in and out, latency per span, errors encountered, and cost estimates. Store these traces in a queryable format like JSON in a columnar store or a dedicated tracing platform, indexed by request ID, user ID, model version, agent version, and error type so you can answer questions like "show me failing traces for version 3 using tool X." Live dashboards should surface requests per minute, success rate broken down by error type, latency percentiles (p50, p95, p99), and cost per request with breakdowns by model. Production metrics also need task success indicators from user feedback buttons or downstream KPIs, along with periodic sampling of real traffic for human review.
Key Takeaways
- Define concrete, use-case-specific success metrics before you ship anything
- Build repeatable offline evals integrated into CI to catch regressions early
- Structured traces with hierarchical spans are essential for debugging agent graphs
- Store trace metadata in queryable formats so you can slice failures by version and tool
- Online evaluation through user feedback and traffic sampling keeps your metrics grounded in reality