Multi-agent AI systems have a dirty secret: they fail silently while looking perfectly healthy. That's the core insight from Hari's detailed breakdown of how he built production monitoring for a multi-agent system using Langfuse, Azure OpenAI, and TypeScript.

The Monitoring Gap Nobody Talks About

Traditional infrastructure monitoring will tell you everything is fine. APIs return 200 OK. Latency looks acceptable. Users get responses. But here's what it won't show you: whether the agent routed your query to the right specialist, whether it's hallucinating information, or whether it's completely ignoring outputs from downstream agents. These aren't infrastructure failures—they're decision-layer failures that standard dashboards simply cannot see.

Full Trace Visibility with Langfuse

The fix starts with tracing every single execution step. Hari instrumented his entire pipeline to capture tool calls, input/output payloads, token usage per agent, and latency at each stage. If an agent touched something, it got logged. No black boxes allowed. This gave him the visibility needed to actually understand what was happening inside the multi-agent workflow—not just whether it completed, but how and why.

Deterministic Checks, Faithfulness Tests, and LLM Judges

Not everything requires expensive LLM evaluation. Hari added rule-based deterministic checks for binary validations: Did the agent call tools from the correct domain? Was the expected workflow followed? These run fast and cheap as simple pass/fail assertions. For hallucination detection specifically, he implemented faithfulness checks that compare final responses against specialist agent outputs—flagging anything introduced in the final layer that wasn't grounded in source material. The expensive part: using Azure OpenAI as LLM judges to evaluate routing correctness, response quality, attribution accuracy, and conflict handling for every multi-agent response. Worth it, according to Hari.

What Actually Got Caught

This monitoring pipeline surfaced issues that normal dashboards completely missed. Wrong attribution emerged when correct insights were assigned to the wrong specialist agent. Ignored outputs appeared when the orchestration layer simply dropped specialist responses without explanation. Routing mistakes sent queries to entirely wrong agents on a regular basis. None of these triggered alerts in traditional infrastructure monitoring because, technically, nothing was broken from a systems perspective—it was just producing completely wrong answers.

Stack Breakdown

The observability layer runs on Langfuse for trace visibility and evaluation pipelines. LLM judges use Azure OpenAI for quality assessment. Deterministic validation logic lives in TypeScript as rule-based assertions. This combination provides both the low-level execution traces needed to debug issues and the high-level semantic evaluation needed to catch decision failures.

Key Takeaways

  • Traditional monitoring (uptime, latency) cannot detect AI decision-layer failures
  • Full trace instrumentation with Langfuse eliminates visibility blind spots in multi-agent workflows
  • Deterministic checks handle fast, cheap validation of workflow rules
  • Faithfulness tests compare final outputs against specialist responses to catch hallucinations
  • LLM judges evaluate semantic quality: routing correctness, attribution accuracy, conflict handling

The Bottom Line

If you're running multi-agent systems and only monitoring infrastructure metrics, you're flying blind. A successful response can still be completely wrong—and your dashboards will never tell you the difference until users start complaining about nonsense outputs they shouldn't have received in the first place.