Claude Sonnet 5 has landed, and Anthropic's latest mid-tier offering is flexing harder than expected on agentic workloads—though the performance gains come with a price tag that won't please budget-conscious engineering teams. According to benchmarks from Artificial Analysis, Sonnet 5 scores 53 on their Intelligence Index, tying GPT-5.5 (high reasoning) and landing just 2-3 points behind Opus 4.8 and GPT-5.5 xhigh. That's respectable positioning for a model that costs considerably less than its flagship sibling on paper.
Token Hunger Scales With Performance
Here's where things get interesting for anyone running these models at scale: Sonnet 5 is hungry. The evaluation data shows it used roughly 40% more output tokens per Intelligence Index task compared to Sonnet 4.6, and approximately three times the agentic turns when Anthropic pushed into max effort mode on AA-Briefcase and GDPval-AA knowledge work evaluations. That "effort" dial—now expanded with a new xhigh setting matching Opus 4.8's five levels—doesn't just adjust quality; it directly controls how many API calls your pipeline will fire off. On GDPval-AA specifically, max effort consumed around six times more turns than low effort.
The Cost Reality Check
Let's talk money. At standard pricing of $3 input / $15 output per million tokens, Claude Sonnet 5 averaged $2.29 per task on the Intelligence Index—a roughly two-fold increase over Sonnet 4.6 and about 15% more expensive than Claude Opus 4.8. Yes, you read that correctly: a "cheaper" model is costing more per completed task because it generates so many additional tokens trying to get things right. Anthropic is offering promotional pricing of $2/$10 until September 1 (a one-third reduction), which helps, but teams running high-volume agentic pipelines should calculate whether the efficiency gains justify the burn rate.
Where Sonnet 5 Actually Wins
On pure agentic knowledge work—specifically AA-Briefcase and GDPval-AA benchmarks using Anthropic's open-source Stirrup reference harness—Sonnet 5 matches or slightly outperforms Opus 4.8. This matters for anyone building document processing, research automation, or multi-step analysis pipelines where the model needs to browse, reason, and produce polished professional outputs. The gap closes significantly on lighter reasoning tasks: Sonnet 5 improved +9 points on Terminal-Bench v2.1, +10 points on Humanity's Last Exam, and +7 points on SciCode compared to its predecessor.
Where It Still Falls Short
Heavy lifting remains the domain of larger models. On CritPt—a frontier physics reasoning benchmark from Argonne and UIUC researchers—Sonnet 5 scored 17%, which represents a healthy 14-point jump over Sonnet 4.6 but still trails GLM-5.2, Claude Opus/Fable, and GPT-5.5 variants. If your use case involves complex multi-step deduction or deep domain expertise beyond general knowledge work, the extra spend on Opus 4.8 might actually be the more economical choice in terms of tokens spent per correct answer.
Key Takeaways
- Sonnet 5 ranks #5 on the Artificial Analysis Intelligence Index with a score of 53, matching GPT-5.5 (high reasoning)
- Agentic performance rivals Opus 4.8 on AA-Briefcase and GDPval-AA benchmarks, making it viable for knowledge automation pipelines
- Token usage scales dramatically with effort settings—max effort uses ~6x more turns than low effort on complex tasks
- Cost per task ($2.29 at standard pricing) exceeds both Sonnet 4.6 (~2x) and Opus 4.8 (~15%) despite lower per-token rates
The Bottom Line
Sonnet 5 is a capable agentic model that punches above its weight class—but the "mid-tier" pricing narrative obscures how expensive these models become when you actually let them work hard. For teams building production AI agents, the math matters more than the benchmark scores: run the numbers on your specific use case before assuming Sonnet 5 is the budget option.