Researchers from Beth Israel Deaconess Medical Center and Harvard University just dropped a study in Science that should make every medical professional uncomfortable—and that's probably an understatement. The team tested OpenAI's o1 model against attending physicians across five diagnostic tasks, and the results weren't even close. In early emergency room cases where patients provide limited information about their ailments, o1 identified exact or very close diagnoses 67% of the time, compared to roughly 50% to 55% for doctors given identical case presentations.

The Numbers Don't Lie

The LLM's clinical reasoning score—the metric measuring how well it explained its diagnostic thinking and next steps—hit a near-perfect 98% across cases. Attending physicians managed that same standard only 35% of the time. When the researchers tested o1 against two physicians on real patient cases from Beth Israel's ER, the AI maintained its edge throughout all three stages of emergency care: intake, evaluation, and treatment decision. The gap narrowed as more information became available, but o1 still outperformed doctors by 2% to 10% even in later stages of care. Adam Rodman, an internist at Beth Israel and co-author of the paper, admits the team worried nobody would believe results this stark. 'The gap between the model and humans was so robust across all of the tasks,' he noted. Thomas Buckley, a computer scientist at Harvard who co-authored the study, called the final ER diagnostic test 'the most important' part of their research because it used real-world data that was incomplete, biased, and messy—exactly what AI will encounter in actual hospital settings.

The Ancient History Problem

Here's where things get interesting from an insider perspective: OpenAI's o1 was first released in late 2024. In machine learning terms, that's 'ancient history.' Stanford internist Eric Strong, who wasn't involved with the study, called the model's age 'irrelevant' because newer models are almost certainly performing at least as well, if not significantly better. The researchers tested a version that's already been superseded by faster iterations in the AI development cycle—a reminder that benchmarks for AI medical systems may be outdated before they're even published.

What This Can't Do Yet

The study has real constraints worth noting. Researchers provided o1 only written case information, excluding imaging data like CT scans, X-rays, or pathology slides—inputs central to diagnosing blood clots, cancers, and dozens of other conditions. The model also wasn't tested on hospitalized patients requiring days of accumulated medical history. Rodman himself acknowledged the current architecture likely 'wouldn't work for a hospitalized patient who has days and days of information' and warned performance would drop off in those scenarios. The team is already running new experiments with longer-term, broader real-world data, but the fundamental question remains: can these systems improve actual patient outcomes outside controlled research conditions? Daniel McDuff, a computer scientist at Google uninvolved with the work, called testing AI 'in a real-world setting' exciting—but emphasized understanding how models perform as someone's care 'evolves over time' is the next critical challenge.

Key Takeaways

  • OpenAI's o1 diagnosed correctly in 67% of early ER cases versus ~50-55% for physicians
  • The LLM scored perfect clinical reasoning on 98% of examined cases; doctors hit that standard only 35% of the time
  • Researchers used real patient data from Beth Israel Deaconess Medical Center across all three stages of emergency care
  • The model tested (o1, released late 2024) is already outdated—newer AI likely performs even better

The Bottom Line

This isn't a future concern anymore—this is happening now. While critics will rightly point out o1 lacks access to imaging and long-term patient histories, the trajectory is unmistakable: AI systems are already matching or beating trained physicians on core diagnostic tasks under real-world conditions. Healthcare's going to look very different in ten years, and this study suggests the caregivers walking those ER floors might not be human at all.