Claude Code's Prompt Caching Explained: What Kills Your Cache and What Keeps It Warm

Claude Code just became a lot more transparent about how it keeps your coding sessions snappy and your API bills from exploding. Anthropic's official documentation for its CLI tool now spells out the internals of its prompt caching system—and it's a masterclass in understanding where performance comes from, and more importantly, where it goes when you accidentally nuke your cache mid-task.

The Layered Cache Architecture

Here's how it works: every time you send a message in Claude Code, it makes a fresh API request. The model has no memory between calls, so the entire context gets re-sent—system prompt, project files, conversation history, everything. Without caching, that's a massive amount of reprocessing on every single turn. With caching, the API matches the prefix (the start) of each new request against recently processed content and only computes what changed. Claude Code organizes that prefix into three distinct layers: system prompt (core instructions and tool definitions), project context (CLAUDE.md and auto-memory), and conversation history (your messages, Claude's responses, tool results). The ordering isn't arbitrary—content that rarely changes sits at the top. When you edit a file or send a new message, only the bottom layer shifts while everything above it stays cached. This layered approach means a change to your conversation doesn't touch the system prompt or project context layers. But flip the system prompt? Everything below it gets recomputed because all that content now sits behind a different prefix. The match is exact—change anything in the chain, and you break the cache for everything downstream.

What Actually Breaks Your Cache

Not all cache invalidations are obvious. Switching models obviously starts fresh (each model has its own cache), but did you know that enabling fast mode mid-session also forces a full rebuild? The fast mode header becomes part of the cache key, so toggling it after hours of work means paying for uncached input tokens at fast mode rates—which is exactly why turning it on at session start costs less than flipping it deep into a long conversation. MCP server connections are another sneaky one. When tools load directly into the prefix (the default behavior), connecting or disconnecting an MCP server invalidates everything below it. The fun part? This can happen without any action from you—a stdio server's process crashes, an HTTP session expires, or a server auto-reconnects after a blip. Your cache is toast, and you're stuck with a slower, pricier next turn. Plan mode users take note: the opusplan model setting resolves to Opus during planning and Sonnet during execution, which means every toggle between modes is technically a model switch starting a fresh cache. Automatic safety fallback on Fable 5 has the same effect—the session continues on default Opus after a re-run.

The TTL Game: Five Minutes vs. One Hour

Cache entries don't live forever. The API offers two time-to-live options: five minutes and one hour (billed at higher write rates). Claude Code picks automatically based on how you authenticate. On a Claude subscription? You get the one-hour TTL included in your plan, so longer breaks between coding sessions won't force a full rebuild when you come back. API key users aren't so lucky—the default is five minutes, which means stepping away for coffee might cost you. Set ENABLE_PROMPT_CACHING_1H=1 to opt into the extended TTL on providers like Bedrock, Vertex AI, or direct API access. On Bedrock specifically, one-hour TTL availability varies by model and region, so check the AWS docs if cache token counts stay stubbornly at zero.

Cache Scope: It's Not What You Think

Here's something that'll twist your brain—Claude Code's cache is effectively scoped to one machine and directory. The system prompt embeds your working directory, platform, shell, OS version, and auto-memory paths. Two sessions in different directories build completely different prefixes and miss each other's cache entirely. Even git worktrees of the same repository have their own caches because each worktree has its own working directory. Parallel sessions in the same directory? They'll read each other's cache if the prefix matches. Sequential sessions share a prefix only when the git status snapshot at startup is identical—branch name and recent commits are baked into the system prompt too.

Key Takeaways

Fast mode toggle mid-session forces a full rebuild; enable it at session start to minimize costs
MCP server disconnections (planned or accidental) invalidate cache when tools load directly into prefix
Plan mode with opusplan = model switch on every toggle = fresh cache each time
Set ENABLE_PROMPT_CACHING_1H=1 if you're burning through context and stepping away between sessions

The Bottom Line

Prompt caching is everything in Claude Code—it's the difference between a responsive coding assistant and a sluggish, overpriced one. Understanding which actions invalidate your cache isn't just trivia; it's operational knowledge that directly impacts how much you pay and how fast Claude responds when you're deep in a complex refactor. Bookmark that documentation page. You'll be referring back to it more than you'd expect.

> Claude Code's Prompt Caching Explained: What Kills Your Cache and What Keeps It Warm