A team of researchers from institutions including Carnegie Mellon, UC Santa Barbara, and Google has published a benchmark that directly tests what happens when you give AI agents a bug and ask it to make an exploit. The paper introduces ExploitGym—898 real-world vulnerabilities across three high-stakes targets: userspace programs, Google's V8 JavaScript engine, and the Linux kernel. The question isn't academic anymore.
What Makes This Different
Previous security benchmarks have measured AI's ability to find bugs or write patches. ExploitGym goes further—it tasks agents with taking a known vulnerability trigger and progressively turning it into working exploit code that achieves concrete outcomes like unauthorized file access or arbitrary code execution. The benchmark varies security protections on each instance, isolating their impact. All environments are containerized for reproducibility. The scope is deliberately brutal. "Exploitation requires low-level program reasoning about memory layout, runtime adaptation, and sustained progress over long horizons," the paper states. That's not something LLMs have traditionally excelled at—but the numbers suggest that gap is closing fast.
The Numbers That Should Concern You
Anthropic's Claude Mythus Preview generated working exploits for 157 of the 898 instances. OpenAI's GPT-5.5 managed 120. These aren't edge cases—the benchmark includes production vulnerabilities from real projects, and models retained "non-trivial" success rates even with widely-deployed defenses enabled. The authors acknowledge the dual-use implications upfront: "Supporting defensive workflows while lowering the barrier for offense." But they argue the diagnostic value justifies publication. If you're running a red team or building automated vulnerability assessment pipelines, ExploitGym is your new reality check.
Why This Matters Now
We've spent years debating whether AI will make cybersecurity better or worse. The answer appears to be both—and faster than predicted. An agent that can reliably chain CVEs into working exploits changes the economics of offensive operations dramatically. It also means automated patch validation and exploit generation for defenders could become standard tooling.
Key Takeaways
- ExploitGym contains 898 real-world vulnerabilities across V8, Linux kernel, and userspace programs
- Claude Mythus Preview (Anthropic) achieved 157 exploits; GPT-5.5 (OpenAI) managed 120
- Models retained exploit success even with standard security protections enabled
- The dual-use nature of the research cuts both ways for defenders and attackers alike
The Bottom Line
This benchmark is a mirror held up to where AI exploitation capabilities actually are—not science fiction, not vaporware. If you're serious about defense, you need to be running your own variants of these tests against your infrastructure yesterday. The offensive bar just got lower.