The Anthropic docs are solid, but they paper over the friction points that show up at 2 AM when your production bill triples or requests start silently failing. After a few months running Claude in production across several experiments, here's what actually bites you—and how to stop it.
Prompt Caching Isn't Free
Anthropic's caching documentation makes it sound like free lunch. It isn't. Cache writes cost about 1.25× normal input rates, while reads come in at 0.10× normal input rates. That means your cache break-even point sits around two reuses after the write—sounds fine until you run A/B experiments that randomize system prompts. The trap: suddenly each variant has its own cache partition, each gets written far more than read, and your bill goes up instead of down. Check your usage block for this telltale pattern: high cache_creation_input_tokens with low cache_read_input_tokens in a roughly 10:1 ratio means you're bleeding money on caching. Fix: keep system prompts stable across variants and push A/B variation into messages[]. Rule of thumb—if your cache hit ratio sits under 50%, the math probably works against you.
Build for 529 Errors by Default
Anthropic returns HTTP 529 (overload) more often than newcomers expect. At peak hours on Sonnet, I've measured 1-3% of requests hitting this wall even at reasonable concurrency levels. New developers treat 529 as a bug and surface it to users. That's the wrong call. The production-ready approach is a fallback chain: try Sonnet 4.5 first, then retry three times with exponential backoff on 529, drop to Sonnet 4 if still failing, fall back to Haiku for final attempts before serving cached responses as last resort. Nobody loves canned replies, but they beat a 1% rate of broken UI that you didn't choose to ship.
You Need to Roll Your Own Idempotency
Anthropic's API doesn't expose idempotency keys as of mid-2026. Network timeout plus your retry logic means the same prompt gets billed twice—sometimes multiple times. This one sneaks into your cost analysis and quietly eats margin. The minimum viable approach uses Redis with SHA-256 hashing of model, system, messages, and max_tokens to generate a deduplication key. Mark requests as pending on first flight, store successful responses for 24 hours. Critical detail: don't include metadata.user_id or stream parameters in the hash—they vary per call but don't affect the result.
Streaming Drops More Than You'd Guess
Not occasional drops—more like 0.1-0.5% of streams fail mid-response through proxies, CDNs, and corporate networks. The worst failure mode is clients hanging forever waiting for message_stop events that never arrive because SSE connections silently close. You need a streaming state machine with timeout logic: track last event timestamps, return partial output on any failure rather than swallowing it silently. Set a silence threshold (30 seconds works as a starting point) and treat unexpected closes as errors you can recover from gracefully.
Sonnet Is Usually Wrong for Classification at Scale
This is the most expensive mistake teams make with Claude in production. Classification tasks mean short outputs, easily-evaluatable correctness, millions of repeated calls—perfect storm for burning money on Sonnet's pricing when Haiku performs similarly on well-defined labels. Run the math: 10 million support tickets costs roughly $30,000 with Sonnet but only ~$2,500 with Haiku at comparable accuracy. The move is a two-stage pipeline: classify everything with Haiku first, escalate to Sonnet only for low-confidence results or hard-label edge cases. In every measurement I've run, Haiku hits 92%+ on the common 80% of labels—the remaining 20% needing escalation are the only ones billed at premium rates.
Key Takeaways
- Cache writes cost money: watch your cache_creation vs cache_read ratio closely
- Always build a fallback chain for 529s—don't surface overload errors to users
- Implement client-side idempotency before you ship anything non-trivial
- Streaming requires timeout logic and state machines, not fire-and-forget
- Use Haiku with Sonnet escalation for classification workloads at scale