The numbers are staggering. Code Metal just raised $125 million to rewrite defense industry code using AI. Google and Microsoft both report that 25–30% of their new code is AI-generated. AWS used AI to modernize 40 million lines of COBOL for Toyota. Anthropic built a 100,000-line C compiler with parallel AI agents in two weeks, for under $20,000—code that boots Linux and compiles SQLite, PostgreSQL, Redis, and Lua. Microsoft's CTO predicts 95% of all code will be AI-generated by 2030. The rewriting of the world's software is not coming. It is underway. And almost no one is formally verifying any of it.

The Review Problem

Andrej Karpathy described his own pattern with brutal honesty: "I 'Accept All' always, I don't read the diffs anymore." When AI code is good enough most of the time, humans stop reviewing carefully. Nearly half of AI-generated code fails basic security tests—and newer, larger models do not generate significantly more secure code than their predecessors. The errors are there. The reviewers are not. Even Karpathy later outlined a cautious workflow for code he actually cared about, and when he built his own serious project, he hand-coded it. Consider what happens at scale. A single bug in OpenSSL—Heartbleed—exposed the private communications of millions of users, survived two years of code review, and cost the industry hundreds of millions of dollars to remediate. That was one bug, introduced by one human, in one library. AI is now generating code at a thousand times the speed across every layer of the software stack, and the defenses we relied on—code review, testing, manual inspection—are the same ones that missed Heartbleed for two years.

Mathematical Proof vs. Testing

Testing catches bugs quickly and cheaply. But testing provides confidence. Proof provides a guarantee—and those are not the same thing. Consider an AI rewrite of a TLS library. The code passes every test. But the specification requires constant-time execution: no branch may depend on secret key material, no memory access pattern may leak information. The AI's implementation contains a subtle conditional varying with key bits—a timing side-channel invisible to testing, invisible to code review. A formal proof of constant-time behavior catches it instantly. Without the proof, that vulnerability ships to production. The Harvard Business Review calls this "workshop": AI-generated work that looks polished but requires someone downstream to fix. When that work is a memo, it's annoying. When it's a cryptographic library, it's catastrophic. Chris Lattner, creator of LLVM and Clang, put it bluntly: AI amplifies both good and bad structure. Bad code at AI speed becomes incomprehensible nightmares. Poor software quality already costs the U.S. economy $2.41 trillion per year—a number calculated before AI began writing a quarter or more of new code at leading companies.

Why Lean Is Winning

The article makes a compelling case that verification is no longer just a cost—it can be a catalyst when AI makes proof cheap enough to apply broadly. The key claim: the bottleneck has always been implementation, not specification. Good engineers know what they want to build; they just couldn't afford to prove it correct. If proof cost drops near zero, every domain where correctness matters accelerates. The math community and AI research community have already converged on Lean as their platform of choice. Google DeepMind's AlphaProof, Harmonic's Aristotle, ByteDance's SEED Prover, Axiom, Aleph from Logical Intelligence, and Mistral AI all build on Lean. Every major AI reasoning system that achieved medal-level performance at the International Mathematical Olympiad used Lean—no competing platform was touched by any of them. Lean is backed by Mathlib: over 200,000 formalized theorems and 750 contributors. Five Fields Medalists engage with it. ACM SIGPLAN recognized this convergence with its 2025 Programming Languages Software Award. Enterprise teams are already using Lean in production. AWS verified its Cedar authorization policy engine. Microsoft is using Lean to verify its SymCrypt cryptographic library. Over 8,000 GitHub repositories contain Lean code. More than 700 people are active in the Lean Zulip channel every day.

The zlib Proof and What It Means

At a recent Lean FRO, Kim Morrison ran an experiment that went beyond expectations. A general-purpose AI—Claude, out of the box with no specialized training for theorem proving—converted zlib to Lean with minimal human guidance. It produced a clean Lean implementation, passed the existing test suite confirming behavioral equivalence, and then key properties were stated and proved as mathematical theorems—not tests. The capstone: a machine-checked proof that decompressing a compressed buffer always returns the original data, at every compression level, for the full zlib format. This was not expected to be possible yet. No custom model was needed. No specialized tooling was built. The barrier to verified software is no longer AI capability—it is platform readiness: how rich is the feedback the platform gives AI, how powerful is the automation, how large is the library of prior knowledge?

Key Takeaways

  • Nearly 50% of AI-generated code fails basic security tests—and model size doesn't fix this
  • Testing provides confidence. Proof provides a mathematical guarantee. For cryptographic side-channels, only proof catches them
  • Lean has won the formal verification platform war: every major AI math system uses it, and enterprise teams at AWS and Microsoft already use it in production
  • A general-purpose AI proved zlib correct with no specialized training—demonstrating that verified software generation is possible today

The Bottom Line

AI can write a 100,000-line compiler in two weeks for $20,000. It cannot prove that compiler correct—not without the right platform. That platform exists. Lean has already won the AI theorem-proving space, and it's being used to verify production systems at AWS and Microsoft right now. The question is whether the rest of the industry wakes up before the next Heartbleed—or something far worse—exposes what happens when you let machines write code faster than humans can verify it.