Anthropic's Prompt Caching Docs Drop: Here's How to Actually Cut Your Claude API Bill

Anthropic just published comprehensive documentation for Prompt Caching on the Claude Platform, and if you're running any serious workload against their API, you need to understand this feature inside and out. The core pitch is straightforward: cache your prompt prefixes once, then reuse them across requests at a fraction of the cost. But dig into the docs and you'll find there's real nuance here—get it wrong and you're burning tokens on every request with nothing to show for it.

Two Ways to Cache

The platform supports two caching approaches. Automatic caching lets you drop a single cache_control field at the top level of your request, and the system automatically finds the last cacheable block as your conversation grows. This is the path of least resistance for multi-turn conversations where message history keeps expanding. The alternative is explicit breakpoints—placing cache_control directly on individual content blocks for fine-grained control over exactly what gets cached and when. You can define up to four separate breakpoints if you need different caching frequencies for tools versus system prompts versus conversation context. The pricing structure follows a predictable multiplier system. Cache writes (the initial save) run at 1.25x your base input token price for the standard 5-minute TTL, or 2x for the extended 1-hour option. But here's where it gets interesting: cache reads are only 0.1x base input pricing. That asymmetry is the entire value proposition. For prompts with heavy static context—tool definitions, system instructions, background documents—you're paying a premium once and then retrieving that content cheaply forever (well, for up to an hour).

The Mistake Everyone Makes

The documentation explicitly calls out a trap that trips up developers constantly: placing cache_control on blocks that change every request. If your final block contains a timestamp or per-request user message, the prefix hash changes every time, and the system walks backward looking for prior writes—but finds nothing because you never wrote an entry at any of those earlier positions. The result? Fresh cache write on every single request, zero reads, and you're paying extra for the privilege. The fix is placing your breakpoint on the last block that actually stays consistent across requests. The lookback mechanism compounds this issue. When a cache miss occurs, the system walks backward up to 20 blocks looking for a prior write entry. If you've added more than 20 blocks since your last successful write—common in long-running conversations—the old entry falls outside the window and you're back to processing everything fresh. The docs recommend adding multiple breakpoints from the start if you anticipate hitting this ceiling, rather than trying to retrofit caching after the fact.

Platform Availability and Minimums

Automatic caching works on the Claude API, Claude Platform on AWS, and Microsoft Foundry—but notably absent are Bedrock and Google Cloud, which don't support the automatic approach (though they do support explicit breakpoints). The minimum cacheable prompt length varies significantly by model: 512 tokens for Fable 5 and Mythos 5, scaling up to 4,096 tokens for Opus 4.6, Opus 4.5, and Haiku 4.5. Anything shorter gets processed without caching, silently, with no error returned.

What You Can Actually Cache

Most request components are fair game: tool definitions, system messages, text content in both user and assistant turns, images and documents, and tool use/results. Thinking blocks are the notable exception—they can't be cached directly but get pulled in as part of the overall request when you pass tool results back to continue conversations. On Opus 4.5+ and Sonnet 4.6+, thinking blocks are preserved by default even when non-tool-result content arrives, keeping your cache valid. Older models strip everything after thinking blocks and invalidate downstream cached content.

Key Takeaways

Place cache_control on the LAST block that stays consistent across requests—never on per-request varying content
Cache reads cost 10% of base input pricing; writes run 25% (5m TTL) or 100% more (1h TTL)
Automatic caching is easiest for growing conversations but doesn't work on Bedrock or Google Cloud
Use multiple breakpoints if your conversation exceeds 20 blocks between cache points
Verify caching worked by checking response fields: both cache_creation_input_tokens and cache_read_input_tokens at zero means no caching occurred

> Anthropic's Prompt Caching Docs Drop: Here's How to Actually Cut Your Claude API Bill

Two Ways to Cache

The Mistake Everyone Makes

Platform Availability and Minimums

What You Can Actually Cache

Key Takeaways

> RELATED DISPATCHES