How Cloudflare Built an AI Agent to Query a Billion Events Per Second in Plain English

Cloudflare processes over one billion events every second across 330+ cities in 120 countries. For years, getting answers from that mountain of data was a nightmare of tribal knowledge and credential hunting. Their solution: build their own unified data platform called Town Lake, then slap an AI agent on top called Skipper that lets anyone ask questions in plain English and get correct, auditable answers back in seconds.

The Data Sprawl Problem

Before Town Lake, Cloudflare's data lived in dozens of production databases, ClickHouse clusters, Kafka streams, Google Cloud buckets, BigQuery datasets, and a long tail of pipelines. A simple question like 'How many domains that signed up today are in the Top 100 by traffic?' required knowing which system to ask, what credentials to use, what query language to write, and whether your data was fresh or seven days stale. Their analytics pipeline downsampled to handle 700M+ events per second—fine for dashboards but completely wrong when you're computing someone's billable usage.

Town Lake Architecture

Town Lake is a data lakehouse: a query engine reading from object storage with a metadata layer that makes storage behave like a database. They chose Apache Trino as the query engine—a single SQL query can join Postgres, ClickHouse, and Iceberg tables on R2 without materializing intermediate results. The platform includes DataHub for cataloging every table, column, owner, lineage edge, and glossary term; Lifeguard for access control via D1 with dynamic permission rendering over HTTP; Skimmer for continuous PII detection using Workers AI that scans columns and runs agentic second passes when needed; and Transformer built on Workflows for ELT processing with DAG definitions stored in YAML frontmatter.

Default-Closed Governance

Here's where it gets interesting: tables are inaccessible until reviewed. When a new database connects to Trino or a table is created, Skimmer scans it, classifies columns, and registers it as pending review. Users can't query unreviewed tables—not 'permission denied,' but an actual prompt to request access with Skipper suggesting the right RBAC group. PII is opt-in per session: by default, Trino redacts sensitive columns before they hit your screen. If you have a legitimate need for raw PII (fraud investigation, say), you flip the bit and every query is logged. Non-reviewed columns are hidden from DESCRIBE, SHOW COLUMNS, and SELECT *, so new unreviewed columns don't break existing dashboards.

Skipper: The AI Agent

Skipper goes from natural-language question to validated answer grounded in actual data, code, and institutional knowledge. Built entirely on Cloudflare's developer platform—Workers, Workers AI, Durable Objects, D1, R2, Workflows, KV—it finds the right tables via DataHub search, pulls schemas and lineage, writes SQL, submits to Trino, polls for results, and shows you a table or chart. If something looks wrong (join produced zero rows), Skipper investigates, adjusts, and tries again in closed-loop reasoning. They expose it both via Workers AI backend and as an MCP server for IDE integration.

The Context Problem

An LLM with a SQL prompt and list of table names will hallucinate joins and confidently produce completely wrong numbers. Cloudflare learned this the hard way. Their fix: multiple context layers. Layer 1 provides schema, types, primary keys, foreign keys, and commonly-joined tables based on historical query patterns via DataHub search tools. Layer 2 surfaces human annotations from owning teams. Layer 3 ingests actual SQL that produces each table—a customer_type column looks identical either way, but the SQL reveals it defaults to 'paygo' when Salesforce data is missing. Layer 4 offers curated data model pages as MCP resources. Layer 5 provides runtime introspection: DESCRIBE table, SELECT DISTINCT LIMIT 20—expensive but essential as a safety net.

Code Mode for MCP

For their MCP server implementation, they use something called Code Mode instead of defining 30 individual tools. They expose two: search and execute. The model writes JavaScript snippets that call the entire toolset programmatically in a single sandboxed Dynamic Worker isolate via WorkerLoader. One round-trip executes complex multi-step workflows instead of five separate chatty model calls. 'The model gets to express complex multi-step workflows in a single round-trip, in a language it already knows extremely well,' they noted. Faster, cheaper, and auditable as code.

Key Takeaways

Town Lake serves 91,760 billing queries from 324 employees; billing accounts for 53% of all queries
Everything built on Cloudflare's own products: R2, Workers, Access, Workflows, Durable Objects, D1, KV
Default-closed governance model catches PII automatically and requires human review before tables are queryable
Five context layers prevent hallucinations: schema metadata, human annotations, code-derived knowledge, curated data models, runtime introspection

The Bottom Line

Cloudflare eating their own dog food at this scale is the real story here. Most companies would have spun up Snowflake or Databricks and called it a day. Instead they built a production-grade lakehouse on R2 and Trino with an AI agent that actually understands context—not just schema but lineage, code, and business logic. That's the kind of internal tooling move that separates infrastructure teams from platform teams.

> How Cloudflare Built an AI Agent to Query a Billion Events Per Second in Plain English