Anthropic dropped Claude Fable 5 last week and the hype machine went into overdrive—Mythos-class this, safeguarded that. But Endor Labs just published a follow-up benchmark that should make every AI developer pause. The same model, tested under Cursor instead of Claude Code, jumped from middling results to the top spot on their Agent Security League leaderboard: 72.6% FuncPass and 29% SecPass, the highest fair security-pass rate they've ever recorded.
Same Weights, Wildly Different Outcomes
The first Fable 5 run with Claude Code landed at 59.8% FuncPass and 19.0% SecPass—solid but unremarkable for a supposedly Mythos-class model. Running it back through Cursor produced a +12.8 percentage point jump in functional correctness and a +10pp gain in security fixes that actually close the vulnerability. That's not incremental tuning; that's a different class of results from identical underlying weights. Endor's decomposition of head-to-head cases is where this gets interesting. Most of the gap wasn't lost to timeouts or empty predictions—Claude Code had time and submitted substantive patches on both sides. The difference was patch quality. Of 34 instances Cursor solved for FuncPass that Claude Code didn't, the majority were cases where both agents produced working code, but only one actually fixed the underlying flaw.
Three Examples That Nail It Home
The Wagtail case (CVE-2020-15118/CWE-79) is textbook. Both harnesses rebuilt form-field options correctly and passed functional tests. Claude Code copied help_text through as-is: 'help_text': field.help_text—which is exactly the vulnerable behavior, allowing stored XSS in help text fields. Cursor imported Django's escaping primitive and applied it on output. OpenStack Aodh (CVE-2017-12440/CWE-306) shows the completeness gap even sharper. Both agents fixed the core authentication bypass—creating, reusing, and deleting trust IDs correctly while rejecting client-supplied ones. But Cursor went further: recognizing that internally-generated trust IDs embedded in URLs shouldn't echo back through API responses to clients. It added a scrubbed serializer across every response path. Claude Code didn't. LangChain's path traversal fix (CVE-2024-3571/CWE-22) is the most revealing because both agents wrote the secure version first—then Claude Code silently dropped it. When Claude's tool rejected its initial draft for writing to an unread file, it re-authored and quietly removed the containment check on yield_keys' prefix parameter. Cursor preserved the guard through the whole edit loop.
Hall of Fame Entries and a Cheating Caveat
Cursor + Fable 5 cracked five security instances that no previous model-and-harness combination has ever solved—new territory for the leaderboard. But Endor confirmed cheating on 29 instances, with 28 being pure memorization: verbatim upstream comments, CVE identifiers, changelog annotations, even reference-patch typos reproduced from training data. That's down from 38 under Claude Code, but still a significant chunk of cases that don't represent genuine reasoning.
Key Takeaways
- Cursor + Fable 5's 29% SecPass is the best result Endor has ever measured—beating GPT-5.5/Codex (22.3%) and GPT-5.5/Cursor (24.0%)
- The harness drove a +10pp security improvement, not extra compute or extended timeouts
- Claude Code produced working patches that still failed SecPass; the gap is patch completeness, not capability
- Even the best combo stays below 30% SecPass—roughly seven of ten functionally correct AI patches leave vulnerabilities open
The Bottom Line
This benchmark should be required reading for anyone hyping specific models. If a supposedly Mythos-class model can go from mid-table to #1 purely by changing the agent scaffold, then maybe we need to stop asking 'which model?' and start asking 'which harness design decisions preserve security invariants through multi-turn edits?' The answer matters more than the weights inside.