When you're running production AI agents, you don't get to rely on launch-day benchmarks alone. That's the philosophy behind MarginLab's daily tracker, which tests Claude Code against a curated subset of SWE-Bench-Pro using the plain CLI with no custom harness—catching exactly the kind of silent, day-to-day regressions that published numbers can't reveal. In late May, their monitoring caught something significant: Opus 4.7's pass rate dropped well below its established 65% baseline and stayed there for five consecutive days before recovering the moment Opus 4.8 shipped on May 28.
What the Tracker Found
The degradation wasn't subtle. Looking at daily pass rates, performance cratered from a stable 64% around mid-May down to the 50-54% range through late May—a sustained collapse that fell well outside the ±13.0% significance threshold (p < 0.05). The weekly rolling average told the same story in aggregate. Other per-run metrics remained steady—output tokens, runtime—making this look less like a model capability issue and more like something in the execution layer was breaking down.
The Smoking Gun: CLI Version Correlation
Here's where it gets interesting. Every benchmark run records which Claude Code CLI version executed it. When MarginLab lined those versions up against daily pass rates, the pattern became unmistakable. On May 21, Claude Code v2.1.148 ran and posted a clean 64%. The very next day—May 22—v2.1.150 dropped performance to 50%. That same CLI version persisted through May 25, holding the pass rate at 50%, 54%, and 50% respectively. Even after upgrading to v2.1.152 on May 26, numbers stayed depressed at 52%. The moment v2.1.153 landed on May 27, everything snapped back to 66%. By May 28 with Opus 4.8 now in the mix, pass rate hit 72%—well above baseline.
What Changed in the Agent Behavior
Alongside the CLI version correlation, two behavioral metrics stood out during the degraded window: tool calls and input tokens. Tool calls spiked roughly 60% per task across those five days while input tokens simultaneously dropped. The agent was looping harder, making more function invocations to accomplish the same work, but feeding less context in each attempt. Both patterns vanished the instant v2.1.153 shipped. This points directly at something introduced around v2.1.150 and v2.1.151 that changed how Claude Code interacts with tools—potentially a prompt injection issue, a tool result parsing change, or a retry loop problem.
Conclusion: Harness Issue, Not Model Regression
MarginLab's verdict is clear: this appears to be a harness issue in the CLI layer rather than an Opus 4.7 model regression. The timing aligns with Claude Code updates while the underlying model remained unchanged, and the behavioral signature—increased tool calls, decreased input tokens—is characteristic of execution-layer bugs rather than capability drift. This isn't MarginLab's first time catching pre-release degradation aligned with new model launches, either.
Key Takeaways
- Daily production benchmarks caught a five-day performance collapse in Opus 4.7 that stayed hidden from launch-day metrics
- The regression correlated exactly with Claude Code CLI v2.1.150 and v2.1.152—not the model version
- Agents made ~60% more tool calls per task during the degraded period, suggesting execution-layer issues rather than capability problems
- Performance snapped back to 66%+ immediately upon upgrading to v2.1.153, confirming a CLI-level fix
The Bottom Line
This is exactly why you can't trust benchmark numbers alone—someone has to be watching production behavior in real time. MarginLab just proved their monitoring infrastructure caught something Anthropic's own release process apparently missed. If you're running Claude Code in critical pipelines, now might be a good time to audit what version you've got deployed.