How to Evaluate Any AI SRE Tool: a Framework Built From 15 Posts of Production SLIs

Your manager just forwarded you a Gartner report. Analyst recognition of AI SRE is here, on-call pressure is mounting, and the question landing in every SRE team's backlog right now is: should we buy something, build something, or wait? Ajay Devineni spent four months building the measurement layer for AI agents from scratch, and this post distills that work into a vendor evaluation framework you can actually use.

The Problem With Vendor Benchmarks

Let's be real about what's happening right now. Datadog's Bits AI SRE claims decreases in time to resolution up to 95%. New Relic says users resolved incidents 25% faster with AI features. Both numbers are published. Both are real—but measured in environments that probably don't match yours. A 95% MTTR improvement on a system with clean telemetry, well-structured runbooks, and narrow incident categories is a completely different number than what you'll see with fragmented observability, complex dependency graphs, and novel failure modes.

Question 1: Does It Instrument the Reasoning Layer?

The semantic gap—the space between what an agent intended and what it executed—is invisible to traditional infrastructure APM. Existing tools observe high-level intent or low-level actions, not the correlation between them. Ask any vendor: do you track re-planning cycles per task? Can I see how many times the agent changed its approach before completing or escalating? Can I query that history after an incident? If the answer is "we log prompts and tool calls," that's Layer 1 observability—useful and necessary, but insufficient. You need Layer 3: one structured record per agent task showing the full decision sequence.

Question 2: What Is the Human Escalation Rate in Their Benchmark?

HER—the fraction of agent decisions that escalated to human judgment—is the most honest single metric for how autonomous a tool actually is. Here's the trap: a low MTTR number paired with a high HER means humans were doing most of the resolution work, faster only because the agent assembled context for them. That's valuable. It's not the same as autonomous remediation. Ask vendors to disclose what percentage of incidents the agent resolved without human action versus required approval before execution.

Question 3: Does It Check SLO State Before Acting?

An agent that remediates without checking your current error budget can compound a degraded situation. The Pre-Action SRE Gate requires three checks before any autonomous action: error budget remaining, AQDD state, and the agent's own HER trend. Ask vendors whether their agent checks SLO error budget before executing remediation. What happens if the error budget is critically low—does it act anyway or escalate? Can you configure the gate thresholds?

Question 4: What Is the Defined Blast Radius Per Agent?

Every AI SRE tool has an implicit blast radius—the set of systems and failure modes it was trained and tested on. Good tools make this explicit. Komodor's Klaudia, for example, is trained specifically on pod crashes, failed rollouts, autoscaler friction, misconfigurations, and cascading failures in Kubernetes environments. That specificity is its blast radius—95% accuracy in that domain does not mean 95% accuracy outside it. Ask vendors what systems the agent can modify autonomously versus what's write-locked.

Question 5: What Is the Ownership Model When It's Wrong?

This is the question vendors like least. When an agent makes a bad remediation decision and compounds the incident, who is accountable? The vendor's SLA covers service availability, not operational consequences of agent actions. In your environment, this should map to ARO (Agent Reliability Ownership) registration: a named human owner, defined escalation path, and audit log of every gate check the agent ran before acting.

Build vs Buy Decision Matrix

Given these five questions, here's how Devineni frames the decision: buy if your failure distribution maps closely to the tool's blast radius, you don't need custom SLIs beyond what the vendor provides, and they can answer all five questions with specifics. Build if your failure distribution is broad or novel, you need custom SLIs (DQR, RTD, HER, AQDD are absent from commercial tools today), or regulatory requirements mandate audit trails the vendor doesn't generate. The most realistic path for most teams? Hybrid: buy the investigation layer where vendors genuinely excel at assembling incident context faster than humans, but build the governance layer—Pre-Action Gates, ARO registration, Semantic Gap detection.

Key Takeaways

Vendor benchmarks measure tools in optimal conditions; always verify against your actual environment
Human Escalation Rate (HER) is the most honest metric for true autonomous capability
Any tool without a Pre-Action SLO gate isn't safe for production systems with burning error budgets
Blast radius must be explicit—if vendors can't define it, their accuracy numbers are marketing claims
Audit logs aren't optional in regulated environments; you literally cannot write complete postmortems without them

The Bottom Line

This framework won't make the decision for you, but it'll give you a defensible, technically grounded recommendation to bring back to your team. Whether the answer is buy, build, or hybrid—the five questions and the ToolEvaluationScore dataclass in Devineni's agentsre library give practitioners a way to stop being impressed by 95% MTTR claims and start asking what actually happens when the agent decides wrong.

> How to Evaluate Any AI SRE Tool: a Framework Built From 15 Posts of Production SLIs