If you've shipped AI agents to production, you already know this story too well. You fix a bug in your agent on Monday. A week later someone swaps the model or tweaks a prompt for better performance. Suddenly that exact same bug is back—and nobody catches it until a user reports it.
The Core Problem
Traditional software has regression testing suites precisely for this scenario. When you push new code, CI runs your test suite and flags regressions before they hit production. AI agents largely don't have this luxury. The non-deterministic nature of large language models makes traditional assertion-based testing nearly impossible—you can't just check if the output matches a fixed string because it never will twice in a row.
How replayd Works
Taimoor Khan built replayd (v0.1.1) to solve exactly this gap. Here's the core loop: when your agent fails in production or during development, you capture that run and save it as a test case. Before shipping any new version—whether that's a model swap, prompt change, or code update—you replay those saved failures against it. If the same failure surfaces again, you catch it before users do. The SDK installs via pip and has zero runtime dependencies in its core package. It's also framework agnostic, meaning you can integrate it regardless of whether you're running LangChain, AutoGen, or a custom agent stack. The project is available on GitHub at github.com/TaimoorKhan10/replayd.
Grading: Where It Gets Interesting
The real innovation isn't capturing runs—it's grading them. Exact output matching doesn't work with LLMs because they're inherently non-deterministic. Khan's solution separates failures into two categories: structural and semantic. Structural failures get deterministic assertions—you can check the structure of what an agent did, even if the exact wording varies. Semantic failures—the tricky ones—use another LLM as a judge to determine whether the same failure mode occurred. This is the key insight: you assert on what the agent DID, not what it said. Did it call the right tool? Did it access the correct resource? That's testable in ways that pure output matching never will be. The LLM-as-judge approach lets you capture intent without requiring byte-for-byte reproducibility.
Key Takeaways
- replayd captures failed agent runs as reusable test cases
- Structural failures get deterministic assertions; semantic ones use an LLM judge
- Assert on agent behavior, not text output—works around LLM non-determinism
- Zero runtime dependencies, framework agnostic, v0.1.1 with rough edges expected
The Bottom Line
This is exactly the kind of tooling the AI agent ecosystem desperately needs right now. As more teams push agents into production-critical paths, having regression testing isn't optional—it's survival. Khan's approach to grading is clever and worth watching as the project matures.