When an AI agent fails in production, the instinct is to tweak the prompt and rerun it. That impulse will destroy everything you need to actually fix the problem. According to developer opswald on DEV.to, that kneejerk rerun changes the model output, retrieved context, tool state, timing, permissions, and external API response. If your agent already sent an email, issued a refund, or called an MCP tool, you've just repeated a side effect while simultaneously wiping the evidence of what went wrong.

The Evidence Preservation First Approach

opswald's guide flips the debugging workflow entirely: capture before you touch anything. This isn't about making every run deterministic—it's about finding the first unsupported decision and turning that failure into something replayable for regression testing down the line. The checklist starts with basics that sound obvious until you realize how often teams skip them. Trace ID, run ID, user/session/job ID, agent version, deployment SHA, model and provider, prompt version, tool registry version, retrieval index version, environment and region, timestamp and timezone. Without this stable identifier joining model calls to application logs to external API writes, your incident review becomes archaeology.

Context Is Everything

The same user request produces different behavior depending on what the agent saw. You need the original user input or job payload, active system instructions, retrieved documents or chunks, memory entries used by the run, account/tenant/role scope, product state before execution, and feature flags. A common failure pattern is a plausible model answer built from incomplete or stale context—when you inspect only the final response, the agent looks unreasonable. When you inspect what it actually saw, the root cause jumps out.

Inspecting Decisions, Not Just Outputs

For agents, asking 'what did the model answer?' misses the point. The real question is 'why did the agent choose this next action?' Look at selected tools or branches, alternatives it could have chosen, guardrails that ran versus ones that should have run but didn't, missing facts, assumptions from summaries or retrieved content, and handoffs between agents or workflow steps.

Tool Calls Are Model Decisions Plus API Events

This is where agent debugging diverges from normal request tracing. For each tool call, preserve the generated arguments, validation result, permission scope, raw versus normalized response, latency, retry count, error payloads, and any external mutation IDs. A tool can return HTTP 200 while still taking the wrong action—issuing a refund to the wrong account, querying the wrong customer data, trusting a partial response, or duplicating a write.

Classify Your Tools Before Debugging

Not all tools debug the same way. Separate them into read-only, write, risky write, human approval required, and external side effect categories. For mutating tools—emails, tickets, refunds, database writes—capture before/after state and an idempotency key. The replay rule is absolute: debugging should not repeat production side effects.

Check Retrieval Before Blaming the Model

Many agent failures are context failures in disguise. Ask whether retrieval returned the right source, if that source was stale, if relevant facts were omitted by chunking or summarization, if memory introduced old user preferences, and whether tool results contradicted earlier retrieved context. If you gave the model bad evidence, prompt tuning hides the symptom without fixing the system.

Convert Incidents Into Regressions

Once root cause is found, create a regression fixture with minimal failing input, pinned context, expected decision or blocked action, tool output stubs, assertions for side effects, and a link back to the production trace. Good fixtures prevent the same class of failure from returning through a future prompt change, model upgrade, retrieval update, or tool schema edit.

Key Takeaways

  • Capture run identity before touching anything—trace ID, versions, timestamps, environment
  • Preserve original trigger and context—the agent's input state matters more than its output
  • Inspect the decision path, not just the final response—what did it choose and why?
  • Classify tools as read-only versus mutating—replay must not repeat side effects
  • Check retrieval before blaming the model—context failures masquerade as reasoning failures

The Bottom Line

Don't rerun until you've frozen everything. Production agent debugging isn't about watching a pretty trace—it is about preserving evidence behind a decision. If you can answer what the agent saw, what it chose, what changed, and how to replay it safely, you can debug the failure. If you cannot, you're guessing—and probably making things worse.