Anthropic dropped Claude Opus 4.8 on May 28, 2026, and the headline isn't just 'better benchmarks' โ€” it's better benchmarks *and* efficiency gains that actually matter in production. The model tops Artificial Analysis's GDPval-AA real-work leaderboard at 1890 Elo, pulling 121 points clear of GPT-5.5 in second place and gaining a massive +137 over its own predecessor. That puts it ahead on tasks modeled after 44 real occupations with actual economic value โ€” not synthetic coding puzzles. And it's doing all this while consuming roughly 35% fewer output tokens per task than Opus 4.7.

GDPval-AA: The Number That Actually Matters

The GDPval-AA benchmark is worth understanding because it isn't another 'pass a test' metric. Independent evaluators at Artificial Analysis gave models shell access and web browsing within an agentic loop, then measured performance on real economic work across industries. Opus 4.8 reached that 1890 Elo using 15% fewer turns and over a third less output than 4.7 required for equivalent results. That's the kind of efficiency improvement that translates directly to API bills โ€” same quality, lower cost per task.

Benchmark Breakdown: Where Opus 4.8 Wins

The numbers paint a clear picture across several benchmarks: | Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | |-----------|----------|----------|---------| | SWE-bench Pro | 69.2% | 64.3% | 58.6% | | OSWorld-Verified (computer use) | 83.4% | 82.8% | 78.7% | | Terminal-Bench 2.1 | 74.6% | 66.1% | 78.2% | The one exception: GPT-5.5 still wins Terminal-Bench 2.1 at 78.2% versus Opus 4.8's 74.6%. If your workload is heavy on raw terminal command sequences, that's a real data point, not noise.

Fast Mode and Agentic Improvements

The headline features are Fast Mode โ€” a research preview serving the same Opus 4.8 model at up to 2.5x higher output tokens per second (at premium pricing) โ€” and mid-conversation system messages that preserve prompt-cache hits on earlier turns. There's also adaptive thinking via thinking: {"type": "adaptive"} that adjusts effort dynamically, plus better tool triggering and compaction for long-horizon agentic coding tasks.

Claude Code Dynamic Workflows: 750K Lines in 11 Days

Launched alongside the model, dynamic workflows let Claude orchestrate tens to hundreds of parallel subagents within a single session. The showcase example involved porting Bun from Zig to Rust โ€” roughly 750,000 lines of code โ€” achieving a 99.8% test-suite pass rate in just 11 days. Two caveats: it's plan-gated (Claude Code Max, Team, and Enterprise only), and token consumption runs substantially higher than normal sessions.

Prompting Opus 4.8: What Actually Changed

If you're coming from older models, the behavioral shifts matter. Effort is now the main dial โ€” start at xhigh for coding and agentic work, keep a minimum of high for anything intelligence-sensitive. The model follows instructions literally now, won't silently generalize or infer unstated requests, and raises tool use substantially when effort hits high/xhigh. One gotcha: Opus 4.8 is genuinely better at finding bugs with higher precision and recall in Anthropic's evals, but if your review harness says 'only report high-severity issues' or 'be conservative,' it follows that more faithfully than older models.

Key Takeaways

  • Same $5/$25 pricing as 4.7 โ€” no price hike for better performance
  • Tops GDPval-AA real-work leaderboard at 1890 Elo (+121 over GPT-5.5)
  • Uses ~35% fewer output tokens per task than its predecessor
  • Fast Mode offers 2.5x throughput at premium pricing (research preview)
  • Claude Code dynamic workflows: plan-gated, high token consumption

The Bottom Line

Opus 4.8 is the rare upgrade that doesn't require asterisks โ€” same price, meaningfully better on real work, and more efficient to boot. If your stack relies on AI for coding or agentic tasks, this is worth evaluating seriously. The one exception remains terminal-heavy workloads where GPT-5.5 still has a narrow edge.