Anthropic just dropped BioMysteryBench, a new bioinformatics evaluation that puts Claude through its paces on real-world biological datasets. The results? Claude Mythos Preview solved 23% of problems that stumped a panel of five domain experts—and cracked 30% of questions humans found flat-out impossible to answer. That's not incremental improvement; that's crossing a threshold.

Why Bioinformatics Is Brutal for AI

Benchmarks like MMLU-Pro and GPQA test what models know, but they don't capture the messy workflow of actual research—reading papers, querying databases, running experiments, writing analysis code. Biology makes this harder: there are often multiple 'right' ways to solve a problem, individual decisions are wildly subjective, and some questions humans literally cannot answer yet. The metformin response puzzle is a perfect example. Researchers have spent a decade chasing predictors of why some type 2 diabetics respond to the drug while others don't—different study designs led to entirely different conclusions, and nobody's pinned down the mechanism of action even after 30 years.

BioMysteryBench by the Numbers

The benchmark contains 99 questions derived from real genomic data (WGS, scRNA-seq, methylation, ChIP-seq, metagenomics) plus proteomics and metabolomics. What makes it clever: answers come from controllable properties of the data or orthogonally validated metadata—not scientist conclusions that might be colored by subjective choices. Questions include identifying which organ a cell-type dataset comes from, determining what gene was knocked out in experimental samples, or matching samples to parent-child relationships from WGS sequences. On 76 human-solvable problems, Claude Opus 4.6 nailed it—86% of the time it solved a problem at all, it solved it reliably (4+ times out of 5 attempts). The model either knows how to get there or it doesn't. But here's where it gets interesting: on the 23 human-difficult questions, that reliability plummeted to 44%. Nearly half of Claude's 'wins' on hard problems came from lucky reasoning paths it couldn't reproduce consistently.

What Claude Does That Humans Don't

Anthropic identified two strategies separating Claude from human benchmarkers. First: pure knowledge retrieval. Tasks requiring a human expert to run meta-analyses or stitch together multiple databases, Opus solved directly using its internal understanding of mechanisms and ontologies combined with live analysis. Second—and this one surprised even the researchers—when uncertain about an answer, Claude layered multiple methods and converged different lines of evidence. Human scientists tend to commit to a single approach; Claude hedges by running several in parallel. The reliability gap tells a deeper story than headline accuracy numbers. Sonnet 4.6 showed the sharpest shift: from 75% reliable wins on solvable problems down to just 22% on difficult ones, with 'brittle' one-or-two-out-of-five solves jumping from 9% to 56%. When Mythos Preview analyzed its own performance, it noted that accurate-but-unreliable solutions on hard problems are essentially lucky guesses rather than genuine capability. The model knows something is there but can't always find the path twice.

Independent Validation From Big Pharma

The findings got reinforcement from an unexpected source: Genentech and Roche just released CompBioBench with 100 computational biology tasks using synthetic data and metadata scrambling. Their results echo BioMysteryBench almost exactly—Claude Opus 4.6 hit 81% overall and 69% on the hardest questions. Different dataset, different methodology, same conclusion: frontier models are genuinely useful collaborators for bioinformatics work now.

Key Takeaways

  • Claude Mythos Preview solved 30% of problems that stumped panels of five domain experts
  • On human-difficult tasks, nearly half of Claude's correct answers came from unreproducible 'lucky' paths rather than reliable methods
  • Two strategies differentiate AI: vast pretraining knowledge (skipping meta-analysis) and multi-method evidence convergence when uncertain
  • Independent validation from Genentech/Roche's CompBioBench confirms the findings with different methodology

The Bottom Line

BioMysteryBench proves what many in the research community suspected but couldn't quantify: AI isn't just catching up to trained scientists on bioinformatics—it has already surpassed them on a meaningful subset of problems. But the reliability gap on hard tasks is the real story for anyone building AI agents meant to operate autonomously in scientific workflows. A 23% solve rate means something only if you can make it reproducible. The frontier's moving fast, and this benchmark just drew the starting line.