Skills load new system context. Peer-agent workflows fork the prefix. Browser automation adds volatile tool output. Compression rewrites history. Model switching fragments cache namespaces. If you're building a capable AI agent and your prompt cache hit rate is lower than expected, this is probably why—and Yafei Lee at OpenClacky has spent two years proving it.
Generation 1: RAG Everything (2024–Early 2025)
OpenClacky's first architecture was textbook Retrieval-Augmented Generation. The team embedded user codebases, documentation, and conversation history into a vector store, ran every query through hybrid retrieval, re-ranking, and query rewriting before the LLM saw anything. It sounded right. It wasn't. Costs never stopped climbing because every codebase update required re-embedding, real-time sync was unreliable, and the vector store lagged behind actual code at all times. They were paying more to search an index that grew increasingly wrong. The recall problem compounded faster. Ninety percent retrieval accuracy sounds decent until you realize it means one in ten queries returns the wrong context—and for a multi-step agent, a wrong file in step two produces a wrong edit in step three and a wasted retry in step four. Lee estimated 97% minimum recall was needed to be net-positive; OpenClacky was nowhere close. On top of that, every extra piece between the user and the LLM is where latency hides and errors cascade. The vector database was one more component that could crash, lag, or return garbage. The team killed RAG entirely for local repos. No embeddings, no vector store, no retrieval pipeline. If the agent needs context, it reads files directly or searches with grep. If documentation needs to be accessible to an agent, make it readable on a website—don't shred it into embeddings. The lesson: the model is already smart enough to work with raw text.
Generation 2: Multi-Agent Orchestration (Mid-2025)
The second attempt borrowed from SWEBench leaderboard strategies: a Planner agent, Coder agent, Reviewer agent, and Tester agent coordinated through a message bus with role-specific prompts. The team got decent benchmark scores. The product was terrible. Every agent handoff was a cache miss—each sub-agent had its own system prompt and cache namespace, so passing context between them serialized state into messages and wiped the receiving agent's cache prefix. Useful context was lost at every boundary. A task one agent could finish in four minutes took fourteen with four agents coordinating. They waited for each other, re-read context the previous agent had already processed, and occasionally contradicted each other's decisions. Cost ran six times higher: four separate cache namespaces, four system prompts, constant serialization overhead. Lee's assessment cuts deep: 'The divide work among specialists intuition that works for human teams doesn't transfer to LLMs. A single frontier model is already a generalist. You're not dividing labor—you're multiplying overhead.' Debugging was a nightmare—tracing which agent caused a final output error meant untangling ambiguous instructions, misinterpreted directions, and missed bugs across the full pipeline. SWEBench scores didn't predict user satisfaction. The team could tune the multi-agent pipeline to pass specific benchmarks while failing at the modes of failure that actually annoyed real users: slow iteration, losing context across handoffs, inconsistent code style. They killed role-based multi-agent orchestration. One main agent, one conversation, one cache namespace. Sub-agents survived only as isolated skill execution contexts invoked through a single stable tool.
Generation 3: The Cache-First Architecture
Two failed generations produced the same conclusion and started generation three from an uncomfortable question: what if everything was optimized around a single agent's cache hit rate—not as a cost hack, but as an architectural principle? High cache hits mean consistent context, faster responses, lower costs. Every subsequent decision served that goal.
Decision 1: Double Cache Markers
Prompt caching works by prefix matching—the LLM provider stores a hash of the message prefix and reuses it on shared prefixes. The naive approach places one cache_control marker on the last message, which breaks in three ways: history grows monotonically so appending a new message shifts the old marker's position (cache miss), tool call retries discard the last message along with its marker (cache miss), and mid-session model switches move markers unnecessarily (cache miss). The fix visible across OpenClacky's git log progressed through incremental patches—'fix: cache,' 'fix: prompt cache,' 'feat: prompt cache works fine'—before arriving at a structural solution: two markers instead of one. Every turn, the system marks two consecutive messages. On the next turn, the provider matches the marker on msg_C and hits everything before it (system prompt plus tools plus full history minus the last message). A new marker is placed on msg_D for the following turn. This rolling double buffer means at any moment there are two breakpoints—one being read from the previous turn and one being written at the current tail. There's never a moment where both buffers are invalid simultaneously. Exactly two markers covers the failure boundary; three would land further back in the prefix writing a segment that will never be read independently, adding cost for no benefit. The double-marker approach also survives tool call retries. When the model retries a failed tool call or the user hits Ctrl-C, the last message gets discarded—but with two markers, the second-to-last marker usually survives, so single-step rollback still hits cache. The system explicitly skips any message tagged system_injected: true from marker selection; these ephemeral messages won't exist in the same form next turn.
Decision 2: Frozen System Prompt
OpenClacky's engineering discipline is brutal and simple: the agent's system prompt is built once at session start, then byte-frozen. Any requirement to put dynamic information in the system prompt gets redirected elsewhere. The reason is foundational—if the system prompt changes, every subsequent cache entry is invalidated with no partial recovery possible. But four kinds of information naturally want to live there: current date, working directory, and OS; current model ID; newly installed skills; and updated user preferences. The solution is a [session context] block injected as a regular user message in the conversation history rather than part of the system prompt. This message carries current date, model ID, OS, and working directory, tagged system_injected: true so cache markers skip it (Decision 1), it doesn't count as a real user turn, and compression discards it cleanly. Injection is date-gated: once per day plus once on model switch. Most sessions see exactly one injection. Skills are rendered into the system prompt at session start only, then frozen—a skill installed mid-session won't appear until next session. The team accepts this friction because skill installation is low-frequency while cache hits are per-turn.
Decision 3: One Meta-Tool for Skills and Sub-Agents
invoke_skill handles more work than any other tool in OpenClacky's toolkit—skill hot-loading, sub-agent architecture, memory recall, and skill self-evolution, all described in under 200 tokens of system prompt. It spawns a sub-agent with its own conversation history but the same tools, returning only invoke_skill → result to the main agent when finished. All intermediate steps stay isolated in the sub-agent's session, which matters for caching: a code review skill that reads dozens of files and produces long analysis would otherwise inflate the main agent's history, triggering compression earlier and increasing costs. For extensibility, adding a new capability means dropping a SKILL.md file into ~/.clacky/skills/. The invoke_skill tool is always present in the schema without needing to know about specific skills at compile time. This single tool replaces what would otherwise be roughly twenty specialized tools—each one bloating the schema and increasing cache invalidation surface area.
Decision 4: Exactly 16 Tools
Tool schemas sit right after the system prompt in the cache prefix. If the schema changes, everything after it is invalidated—and every additional tool isn't just extra tokens, it's extra risk surface for cache invalidation whenever any individual tool changes. The balance is real: too few tools means the model takes multiple steps where one well-designed tool could handle in a single turn, costing more on extra LLM calls. After months of iteration, OpenClacky settled on exactly 16 tools across file I/O (3), search (2), execution (1), browser (1), web (2), task management (4), interaction (1), extension (1), and safety (1). Design principles are straightforward: minimize parameters per tool to reduce ways the model gets it wrong, no overlap between tools, heavy RSpec coverage on every implementation. When a new capability was needed but didn't fit as a tool, it became a skill routed through invoke_skill—code analysis, memory, scheduling, and sub-agent orchestration all invisible to the schema.
Decision 5: Insert-Then-Compress
Context windows are finite. Long tasks fill them. Compression is the single biggest threat to cache hit rates because replacing old messages with a summary changes the prefix, guaranteeing a miss on every subsequent request until the cache warms again. Many agents compress using a separate LLM call with a cheap model and independent system prompt—but that compression call has zero shared prefix with the main session's cache (100% miss on compression itself), and after compression the main history is restructured so its cache is also invalidated, running cold for four to five turns afterward. OpenClacky's approach inserts the compression instruction as a system_injected message at the end of the current conversation before sending a normal request. The compression call hits the existing cache—same system prompt, same tools, same history prefix—with only roughly 500 cold tokens instead of approximately 50,000 for a separate model approach on a 50K-token session. After compression, rebuilding history as [system_prompt, summary, last_N_messages] misses once but only once; from the second turn onward, double markers handle cache continuity automatically. The team compresses at idle rather than waiting for the next message. LLM providers expire prompt caches after roughly five minutes of inactivity—once expired, the next turn is fully cold at ten times cached pricing. An idle timer triggers compression when the user stops typing for 90 seconds and history approaches threshold, establishing a fresh cache breakpoint before TTL expiry while the existing cache is still warm. The million-token context trap gets explicitly rejected: even with perfect caching (0.1× price), one million tokens of input costs the equivalent of 100K full-price tokens per turn. One cache miss means paying for one million tokens at full rate, plus well-documented attention degradation in ultra-long contexts degrades output quality. OpenClacky's strategy is aggressive compression—10K tokens of compressed history at 95% cache hit outperforms one million raw tokens at 99% cache hit on both cost and effectiveness.
Key Takeaways
- RAG adds latency, staleness risk, and recall gaps that compound across multi-step tasks—for local repos, direct file reading beats vector retrieval
- Multi-agent orchestration multiplies overhead rather than dividing it; a single generalist model outperforms specialist agents coordinated through message passing
- Double cache markers create a rolling buffer covering history growth, tool call retries, and mid-session model switches simultaneously with minimal extra cost
- System prompt immutability forces discipline on dynamic information placement—session context lives in tagged messages, not frozen prompts
- One meta-tool (invoke_skill) replaces dozens of specialized tools while keeping the schema stable and cache-friendly
The Bottom Line
Every 'helpful' feature you add to an AI agent is a new surface for cache invalidation—and if you're not thinking about prefix stability from day one, your architecture will fight you at every turn. OpenClacky's two-year journey proves the model is already smart enough; what it needs isn't more models or more tools, it's a better harness that lets prompt caching do its job.