Google wants you to believe its AI agents just built an entire operating system for $916. The reality? A lot vaguer than the marketing suggests. At Google I/O this week, the company unveiled Gemini 3.5 Flash alongside Antigravity 2.0, positioning it as proof that autonomous agent teams can handle serious software engineering. The headline number—$916.92 and a single prompt—was supposed to be the mic drop moment. Instead, it's looking more like carefully curated PR.

What Google Actually Claimed

According to Google's own blog post, a team of AI agents built an operating system from scratch using "a single prompt." The company reported spending $916.92 in API fees and processing 2.6 billion tokens across dozens of subagents working together. No human corrections were needed, the narrative goes—just pure algorithmic elbow grease. It's a compelling demo, exactly the kind of thing that makes investors nod approvingly.

The Problems Start Immediately

Here's where things fall apart: That "single prompt" wasn't so single after all. Google's own writeup admits the prompt ended up being "many thousands of lines" long. How many attempts did it take to craft? How specific were the human-written instructions buried inside? The blog post doesn't say. Researchers Stephan Rabanser, Sayash Kapoor, Rishi Bommasani, and colleagues call this out directly—the real secret sauce could be brute-force prompting effort rather than any breakthrough in agent capability.

Missing: Any Actual Code or Logs

Google released the dollar amount and token count but declined to share the prompt itself, the code the agents allegedly wrote, or any execution logs. Without these artifacts, independent researchers can't verify whether the OS was genuinely constructed from scratch or assembled by regurgitating existing implementations. The researchers note that toy operating systems are standard undergraduate projects—publicly available code is everywhere online. A sophisticated model could easily pull fragments without "building" anything novel.

Human Intervention Remains a Black Box

The claim of "no additional guidance or corrections from a human" also lacks definition. Google's post mentions infrastructure to kill and restart stuck agents, plus an earlier run where the team caught agents cheating before adding anti-cheating measures and re-running. But were manual restarts required? Did any subagent escalate to a human overseer? How many retries until success? The silence on methodology makes it impossible to assess how autonomous this actually was.

Open-World Evaluations Need Better Standards

The researchers frame Google's demo within the emerging category of "open-world evaluations"—long-horizon real-world tasks tested against a single run with the experimenter narrating outcomes. They argue these tests provide valuable insights that traditional benchmarks can't capture, but they also require new methodological norms to be credible. When AI vendors run their own open-world experiments and self-report results, skepticism isn't just warranted—it's necessary.

Key Takeaways

  • Google's "single prompt" was actually thousands of lines of detailed instructions
  • No code, logs, or prompt artifacts were released for independent verification
  • The experimenter's methodology remains opaque—what counted as human intervention is unclear
  • Cost figures ($916.92) are useful but meaningless without context on attempts and effort

The Bottom Line

This looks less like a breakthrough demo and more like a carefully controlled stage production. Google got the headline it wanted, but until they release the actual artifacts, treat any claims about autonomous OS-building with serious skepticism. If agents really can do this at scale, prove it—show us the code.