The Great Benchmark Freakout Remember when everyone lost their minds last week over METR's latest "time horizon" graph? The think tank dropped its evaluation of Anthropic's Claude Mythos Preview, showing frontier AI can now complete software development tasks that would take humans 16 hours—and the Twitterverse promptly melted down. Forecasters worried Mythos had "broken" the measurement tools entirely. Some went full doom mode. But Gary Marcus is here to pour cold water on the panic in his latest Substack piece, and honestly? He makes some solid points worth considering before you start building your bunker.
What METR Is Actually Measuring Let's get technical for a second. The METR "time horizon" graph measures how long software dev tasks frontier models can complete at parity with human engineers—normalized against actual human performance. The curve has been doubling: one minute, two minutes, four, eight, and now sixteen hours. Impressive trajectory on the surface. But Marcus flags two critical asterisks that Twitter decided to ignore entirely. First asterisk? METR's headline metric is 50% success rate. Not 90%, not 99%—just half the time. The 80% version of the same benchmark tells a very different story: same shape, much lower overall performance. Second asterisk? These are specifically software development tasks in controlled conditions. Ask these systems to watch a two-hour movie nobody's seen before and discuss the plot coherently, and you'll quickly remember why hallucinations remain a fundamental problem.
The Symbolic AI Vindication Here's where things get spicy from an insider perspective. Marcus argues that much of Mythos's recent gains come not from raw model scaling but from incorporating symbolic tools—code interpreters, formal verification systems, test harnesses. This is essentially neurosymbolic AI winning in practice, even as the hype machine keeps churning "more parameters = more intelligence." Ramez Naam corroborated this angle by showing that when you normalize Anthropic's internal ECI benchmark against Epoch AI Research's public numbers, Mythos sits right on trend with GPT 5.4—not some discontinuous leap into superintelligence. The techniques work great for coding and math where formal verification applies cleanly. But that's a very specific domain, not general reasoning capability.
The Trillion Pound Baby Fallacy Marcus has a term for the extrapolation everyone's making: "the trillion pound baby fallacy." Just because an infant doubles its weight in four months doesn't mean it'll keep doubling until college graduation. Exponential processes always eventually hit resource constraints—energy, compute, cooling, physical infrastructure. We might also see "benchmarkmaxxing" limits as systems get better at gaming the specific METR task suite. Formal verification techniques may stall on messier real-world problems that lack clean mathematical structure. And let's not forget: solving software design doesn't equal open-ended intelligence. Marcus predicts Mythos will score under 20%—possibly under 10%—on the Remote Labor Index measuring what percentage of online gig work bots can actually automate. Physical jobs remain firmly in human territory for now.
Key Takeaways
- METR's headline metric uses a 50% success threshold—a low bar that overstates capability
- Mythos gains come heavily from symbolic tools, not pure scaling—vindication for neurosymbolic approaches
- ECI benchmark normalization shows Mythos is on-trend with previous models, not revolutionary
- Exponential curves always plateau; resource constraints and task complexity limits will eventually bite