Netflix Engineer Builds Token-Pruning Tool That Saved Users $700K, Then Open-Sourced It

When Tejas Chopra got hit with a $287 bill from Claude Sonnet for what should have been routine debugging and refactoring work, he didn't just pay up—he went to war against token waste. The Netflix senior engineer has developed Project Headroom, an open-source application that compresses agent instructions before they reach the LLM, stripping out redundant boilerplate like verbose JSON schemas, nested API response templates, and repeated database columns.

The $287 Spark

"A lot of our users are people who have been really burned by token costs, more than anything else," Chopra said in his presentation at the Open Source Summit last week. His analysis revealed that as much as 90% of tokens fed to LLMs are redundant machine metadata rather than meaningful human instructions. "This isn't prose. This isn't creative writing. This is compressible data masquerading as text." Since releasing Headroom in January, users have collectively saved an estimated $700,000 and freed up 200 billion tokens for other operations—a significant chunk of change for a project still sitting at v0.22.

How It Works

Headroom runs as a proxy on port 8787, intercepting LLM calls wrapped via the CLI (e.g., "headroom wrap codex"). The system uses several specialized compressors: an Abstract Syntax Tree compressor for code, JSON and DOM compressors for web boilerplate, and statistical "squashers" that learn from feedback loops to avoid over- or under-compression. A key component called CacheAligner ships only changed information to prevent cache misses when session variables like dates or UUIDs shift—because as Chopra warned, "If your system prompt contains a date field... you are effectively getting a cache miss every single time. That will blow up your costs."

Reversible Compression Is the Killer Feature

Headroom's Compress Cache and Retrieve (CCR) module stores original prompts in Redis or SQLite and places markers where compression occurred, allowing the LLM to call back into uncompressed data via an MCP tool if needed. This reversibility sets it apart from commercial token-barber services like YCombinator-funded Token Company, which Chopra acknowledged but positioned as complementary rather than competing.

Context Rot Is Real

Stanford researchers have confirmed that LLMs pay disproportionate attention to the beginning and end of context windows while ignoring middle sections—a phenomenon Chroma dubbed "context rot." Across 18 different models, their research found performance becomes increasingly unreliable as input length grows. Chopra relayed how one Headroom user forked the project for voice-activated applications where silence itself generates tokens, using compression to hit sub-200ms latency thresholds required for natural-feeling interactions.

What's Next

Headroom currently boasts 2,000 GitHub stars and over 120 forks. Chopra acknowledged testing accuracy remains a priority area for improvement—though CCR's storage of original prompts should make validation easier. Future plans include specialized compressors for financial data and tackling audio, image, and video (one user has already forked the project for video parsing). A related tool called Headlight will track token origin across multi-model workflows.

Key Takeaways

Project Headroom has saved users an estimated $700K since January 2026 by pruning redundant tokens before LLM processing
The tool excels at compressing server logs (90% reducible), MCP tool outputs (70% JSON bloat), and repeated database/file metadata
CacheAligner prevents costly cache misses from session-variable changes like timestamps or UUIDs
CCR enables reversible compression, letting LLMs retrieve original context on demand via MCP tools

The Bottom Line

Chopra built this because he got burned—and that's exactly the kind of insider pain that produces real solutions. With enterprise AI bills climbing past what COOs at Uber and Microsoft have publicly lamented, Headroom's $700K savings across a few thousand users suggests token economization is about to become as standard in dev pipelines as linting. The question isn't whether you'll need this—it's how fast you can fork it.

> Netflix Engineer Builds Token-Pruning Tool That Saved Users $700K, Then Open-Sourced It