The Future of Observability Won't Be One Universal AI Agent, ClickHouse Argues

The observability vendor playbook right now reads pretty similarly across the board: build one SRE agent, train it on your platform's telemetry, and let engineers ask questions in natural language instead of wrestling with dashboards or writing SQL. ClickHouse thinks that's fundamentally wrongheaded. In a post published this week, the company argues that debugging doesn't converge neatly like software vendors would prefer—it shapes itself around the specific systems teams own, the way those systems fail, the runbooks they follow, and the operational scars they've accumulated over years. A database team, frontend team, payments team, and infrastructure team do not investigate production issues the same way. That's not a gap that can be papered over with better language models or more training data on a vendor's proprietary schema. Anil K, an engineering lead at DoorDash, confirmed this pattern emerging in their own Slack incident channels: "Engineers previously used to share links to logs or metrics. Now teams are sharing snippets of AI investigations and diving deep into it." The shift is already happening, but toward what endpoint depends heavily on organizational context that no universal agent can replicate. The scale advantage agents bring to observability is real and substantial—something the post doesn't shy away from acknowledging. A human investigator opens a handful of dashboards, runs some queries, inspects a trace or two, and gradually narrows down possibilities. An agent has no such cognitive constraints. While an engineer might compare two time windows, an agent can compare twenty. While a human investigates a few likely causes, an agent pursues dozens of hypotheses simultaneously, gathering evidence and eliminating dead ends continuously. The practical consequence is that investigations become broader and place significantly greater demands on underlying infrastructure—more queries, more historical data access, low-latency responses across the board.

The Context Problem

This is where the universal agent thesis starts falling apart. Agents can only reason over the context they're given. If historical data has been discarded, important context vanishes. If telemetry was heavily sampled, critical evidence may simply not exist in the dataset. Unlike experienced engineers who compensate for these gaps with intuition or institutional knowledge, agents are constrained entirely by completeness and fidelity of available data. Sushant Hiray, AI Leader at RingCentral, pointed to a problem that no vendor can solve from the outside: "Ring Central is a 25-year-old company. There is a lot of tribal knowledge within the operations team that is not documented anywhere, no matter whether we've connected all the wikis. If you don't have any data to give it, it's going to just hallucinate." The next step in an investigation depends on far more than telemetry sitting inside an observability platform. It depends on how a team operates, which signals it trusts, what has broken before, how ownership is divided, and where operational knowledge lives. Much of that context is scattered across runbooks, tickets, postmortems, Slack threads, internal documentation, deployment systems, and the heads of experienced engineers. A vendor can package best practices, but it cannot package the accumulated experience of every engineering team using its product. Two companies running similar technology stacks can investigate identical incidents in completely different ways because their systems, teams, and operational history diverge.

Building for Thousands of Agents

DoorDash's approach reflects this reality directly. Instead of building a super-cooled arrangement from the start, they doubled down on what Anil K described as a headless platform—improving APIs and data storage while building an observability MCP to enable every engineer or team to construct their own agentic workflows tailored toward specific debugging use cases. This stands in direct opposition to the vendor pitch of one agent for everyone. RingCentral's Sushant Hiray echoed similar sentiments: "The way we prefer is to partner with platforms that give us enough flexibility that there's an opportunity for us to build on top of that." By openness, ClickHouse doesn't just mean open-source software—it means giving teams the freedom to choose the best technology at every layer: models, harnesses, tools, workflows, and interfaces. Teams should control where their data lives, how skills are developed, which MCP gateways sit in front of production systems, and how agent behavior is governed and integrated into existing engineering environments. The most successful observability platforms won't be those forcing everyone into a single way of working—they'll provide shared foundations upon which thousands of different agents can be built, each optimized for specific organizations, teams, or problem spaces.

Persistent Investigation Artifacts

When every team builds its own agents running in different harnesses using different models and following different investigation paths, collaboration becomes fragmented. One engineer may start an investigation in an IDE, another in a notebook, another through internal chat interfaces, and another via custom incident workflows. Investigation output can't remain trapped inside transient chat sessions or private agent traces. Teams need durable, inspectable artifacts showing what was queried, what evidence was found, which hypotheses were explored, and why conclusions were reached. These investigation records become more than incident history—they grow into operational knowledge that future engineers and future agents can draw upon when similar problems arise again.

The Bottom Line

The universal SRE agent vision is compelling vendor marketing, but it fundamentally misunderstands how debugging actually works in production environments shaped by years of accumulated team-specific context. The future ClickHouse envisions—thousands of specialized agents built around organizational runbooks, documentation, and operational knowledge—is messier than a single-vendor solution but far more aligned with how engineering teams actually investigate incidents. Whether that future arrives through open platforms enabling custom agentic workflows or vendors eventually building enough flexibility into their converged offerings remains to be seen—but the writing's on the wall for one-size-fits-all observability AI.

> The Future of Observability Won't Be One Universal AI Agent, ClickHouse Argues

The Context Problem

Building for Thousands of Agents

Persistent Investigation Artifacts

The Bottom Line

> RELATED DISPATCHES