Same LLM, Different Agent: Why General Coding AI Falls Short for CI

The question Mendral hears most from engineers: "Why can't I just run Claude Code on my CI?" It's a fair point—both tools are coding agents, both read code and write patches, both reason about failures. But according to the team behind Mendral, the difference is everything around the model itself. Same LLM weights. Completely different harness.

The Token Gap That Changes Everything

Claude Code wraps every message in a large payload optimized for writing software: system prompts tuned for development tasks, tool definitions for file operations and shell commands, context about your current codebase. Mendral's payload looks nothing like it. Their system prompts encode debugging patterns the team learned from years of wrangling CI at Docker and Dagger—rules like "a test that passes locally but fails intermittently on CI is almost never random; check for resource contention and shared state before blaming the code." The tool definitions expose operations that don't exist in a general coding agent: querying months of CI history, correlating failures across branches, tracing flaky tests back to transitive dependency bumps from three weeks prior. Where Claude Code sees your current file, Mendral sees billions of log lines, test execution history, and a living list of known issues in your delivery pipeline. The model is identical. Everything it sees is different—and that changes the entire output.

Built on Go, Sandboxed with Firecracker

Mendral isn't a wrapper around an LLM API. The agent loop runs on their Go backend with two categories of tools: native functions for fast, deterministic operations like querying ClickHouse for log analysis or fetching GitHub metadata, and sandboxed environments for riskier operations like cloning repos or applying patches. Those sandboxes run inside Firecracker microVMs with hardware-level isolation between tenants, booting in under 125ms. This architecture enables something crucial for CI work: suspend and resume. When the agent pushes a fix and needs to wait hours for the pipeline to complete, the sandbox suspends—no idle compute burning. When CI finishes, it resumes in under 25ms with full state preserved. Without this capability, you'd either waste resources or lose your entire execution context mid-investigation.

Processing Billions of Log Lines Weekly

The agent is only as good as what it can see, and Mendral built their ingestion pipeline to process billions of CI log lines per week into ClickHouse, compressed at 35:1 and queryable in milliseconds. A typical investigation scans 335K rows across three or more queries; at P95, that climbs to 940 million rows. The agent writes its own SQL—no predefined query library—and can pull a failing test's pass rate over 90 days, find the commit that introduced regression, check if that same test is flaky on other branches, and cross-reference infrastructure conditions at execution time. Static analysis runs on every tool call the agent makes, inspecting both input and output. When the log query tool returns sparse results—3 data points where you'd normally expect 50—the analysis layer dynamically injects guidance: "Log coverage for this workflow appears incomplete. Consider expanding the window or checking ingestion delay metrics before concluding." This keeps prompts focused on reasoning while the tool layer handles domain guardrails.

Insights: Pattern Recognition Built From Every Investigation

Mendral maintains an insights system—a continuously updated list of active issues in your delivery pipeline. Each anomaly becomes an insight: a flaky test, a CI incident, a security alert, a performance regression. Say the agent opens two separate insights for TestUserAuthFlow failures and TestSessionExpiry timeouts. After three investigations, it merges them—both trace back to the Redis connection pooling change in January. When someone fixes the issue outside Mendral, the agent detects the resolution and auto-closes the insight. If the problem recurs, it reopens with full history intact. After a month on your codebase, the agent knows that TestUserAuthFlow has been flaky since that Redis change, that builds fail Tuesday mornings because of scheduled jobs competing for DB connections, and that the last three @testing-library bumps each broke two E2E suites. Pattern recognition built from every investigation run—your specific codebase's institutional knowledge, encoded automatically.

A Fleet Using All Three Claude Tiers

Internally, Mendral runs as a fleet of agents matched to cognitive demands: Haiku handles log parsing and data extraction (fast and cheap; thousands run daily), Sonnet tackles evidence collection, SQL queries, and deduplication (needs reasoning but not deep analysis), and Opus handles root cause analysis and fix writing (complex multi-step reasoning required). Using Opus for log parsing wastes tokens. Using Haiku for root cause analysis produces worse results. Security boundaries are enforced at the tool level—the agent can't delete branches, force-push, close PRs it didn't open, or modify CI config destructively. Not a prompt instruction the LLM could reason around, but hard enforcement at the harness boundary.

The Bottom Line

Claude Code on your CI will give you answers. Mendral gives you context—and in debugging, context is everything. Same model weights, radically different results, because the infrastructure and data layer around an AI agent matter as much as the model itself.

> Same LLM, Different Agent: Why General Coding AI Falls Short for CI