On June 2, 2026, Claude, ChatGPT, and Grok all experienced outages within the same window. Anthropic's status page showed a fix deployed by 10:42 UTC; OpenAI and xAI recovered around that same stretch. For countless teams building on these providers, their own products went dark—not because of anything wrong in their code, but because they'd wired their uptime to a single vendor's health dashboard.

This Wasn't a Vendor Problem—It Was an Architecture Problem

Here's the hard truth nobody wants to hear: single-vendor reliance on an LLM provider is not a "which provider should we pick" problem. It's a fundamental architecture flaw. Every major model provider has had an outage this year. There is no reliable one to switch to. If your takeaway from June 2 was "we need to move to provider X," you've just traded one status page hostage situation for another. The teams that sailed through yesterday didn't pick the right provider—they built the wrong shape.

The Three-Piece Architecture That Survived

The setup that shrugged off the outage involves a gateway sitting in front of multiple providers, with failover that reroutes failing requests to equivalent-capability models on healthy providers. But the naive version—a simple try/except falling back from GPT to Claude—breaks fast. You downgrade from frontier to tiny model, hammer an already-degraded provider, or fail over to one that's also down. Doing this right takes three non-obvious pieces.

Capability-Bucket Failover

Don't hard-code "if GPT-5.4 fails, try Claude Opus." Instead, bucket your catalog into capability tiers—small, medium, large, frontier, code, reasoning, long-context—and route within the bucket when a provider wobbles. The replacement stays genuinely equivalent in capability. This approach replaced an O(N²) explicit model-to-model fallback map that became unmaintainable past a handful of models.

Health-Weighted Routing

Failover that retries a dead provider on every request turns one provider's outage into your own latency spike. Keep a rolling window of each provider's recent success rate in Redis and weight routing accordingly: healthy providers (≥95% success) stay at full weight, degrading ones (≥50%) drop to a tenth, and clearly-down providers (<50%) get skipped entirely until they recover. The system routes around the outage instead of into it.

Optional Hedging for Latency-Critical Calls

For calls that can't afford tail latency, race two providers in parallel and take whichever responds first—cancelling the loser. This transforms a p99 including provider wobble into a p50. It costs roughly 1.3× tokens on hedged calls, so it's a knob you turn on for traffic that warrants it, not a default.

The Honest Caveats

Full disclosure: I build Prism, an OpenAI-compatible gateway implementing the above. Gateways add a hop and are themselves dependencies—ours runs origin in Mumbai today, fronted by global edge. Cross-provider failover protects against provider outages but doesn't make any gateway immune to its own failures. Anyone selling you 100% uptime from their proxy is selling you something. And equivalent isn't identical—a replacement frontier model keeps you up but has its own quirks.

The Bigger Pattern: Uptime and Cost Are the Same Story

The reliability angle hit hard this week, but it rhymes with cost dynamics already reshaping the industry. On June 2 itself, Microsoft unveiled in-house models at Build explicitly to reduce OpenAI reliance and lower costs. DeepSeek V4 is selling flagship-class output at $0.86 per million tokens—roughly 28× cheaper than frontier incumbents at near-parity on coding benchmarks—and gaining traction precisely because teams want an exit from single-provider pricing lock-in.

What You Should Actually Do

Hobby project or pre-traffic? Call one provider directly and move on. Premature failover is its own complexity tax. But if you have real users and a real bill: put a gateway with genuine cross-provider, health-weighted, capability-bucketed failover between your app and the providers—buy it or build it properly if you build it. The try/except version will fail you exactly when you need it most.

The Bottom Line

June 2 proved what the industry has been quietly learning: don't bet your production on a single AI provider's status page. There's no such thing as the reliable one—just different shapes of failure. Build for the outage, not for the provider.