When Steve handed an AI (Claude Fable 5) the keys to Rue—a two-week-old programming language he'd abandoned for five months—he expected a mess. What he got was worse: 95 merged pull requests, roughly 150 filed issues, and nine closed epics in just three days of autonomous work. But the real story isn't the output. It's what the AI found when it actually used the compiler instead of trusting the dashboard.

The Dashboard Was Lying

By every visible metric, Rue looked healthy on paper: two backends, a specification at 100% test traceability, over a thousand green tests, CI running on three platforms. But Claude immediately noticed something was wrong when it started writing small programs the way a newcomer would—compiling them with the real CLI and actually running them. The first adversarial testing run using hundreds of automatically generated programs revealed the cracks fast. The test harness counted internal compiler errors as passing tests—a compile_fail case was satisfied by SIGABRT, and 98 out of 216 such cases had zero message assertion. More critically: 64-bit arithmetic was frequently just 32-bit arithmetic. i64 division used 32-bit idiv (producing wrong quotients and false div-by-zero panics). match on an i64 compared only the low 32 bits. The array bounds check compared 32 bits of the index—a memory-safety hole you could drive a usize through.

Double-Frees Masked by a Broken Allocator

Perhaps most alarming: the ownership model was sound on paper and completely unsound in practice. Values moved into callees were dropped again at scope exit. Moves inside loops were checked only once. Destructors of overwritten values never ran. All of it stayed hidden because the bump allocator's free function is a no-op—meaning double-frees were real but silent, waiting for the day someone made the allocator actually work. The module system didn't work through the shipped driver at all. @import("std") failed in every configuration. The spec said stable; the stdlib was unusable. And the optimizer at -O1 deleted mandatory safety checks like overflow and bounds panics that the specification requires, because CI never tested above -O0.

How Multi-Agent Workflows Found 150 Bugs

Claude ran what it called the 'hunt/fix loop' using multi-agent workflows with remarkable discipline. Four finder agents fanned out across subsystems—optimizer correctness, drops and destructors, comptime evaluation, pattern matching, type inference—each generating adversarial programs comparing observed behavior against the spec. Their findings went to verifier agents whose explicit job was to refute them: re-run reproducers, read source code, check for duplicates. Roughly half of all raw findings died in verification. 'A bug tracker full of plausible-but-wrong reports is worse than no bug tracker,' Claude noted. Fix cycles took three confirmed clusters and handed each to a worker agent in an isolated git worktree with strict file ownership (two workers touching the same file creates merge hell), pre-assigned error-code ranges, and one standing instruction: reproduce first—a refutation is as valuable as a fix.

What Got Fixed

The marquee changes reveal what two weeks of solo human development versus three days of AI swarm can look like. The ownership model now enforces drop flags per-field, per-path-keyed; every store path drops the value it overwrites in Rust's order (evaluate, drop old, store); moving a field out of a value with a destructor is now a compile error matching Rust's E0509. The module system works. Transitive disk loading, nested std.math.abs() calls, privacy that actually applies to unqualified calls—all shipped. The compile-time evaluator was rebuilt from scratch: the old one rejected negative numbers and couldn't represent u64; the new one is typed, i128-backed, checks overflow at the operand's actual type, and now allows const NEG: i32 = -5 (yes, it wasn't legal before). Declared types are finally real—const BIG: i64 = 5000000000 used to silently truncate to an i32.

The Honest Post-Mortem

Claude graded its own homework with uncomfortable honesty. What worked: verification discipline held, every fix was re-run on integrated trunk by hand before shipping, and the refutation culture caught bugs that were already fixed (RUE-67, RUE-39, RUE-122 came back as 'already fixed, here's a regression test'). The loop compounded—typed comptime evaluation (#974) was only possible because const annotation fixes (#973) landed hours earlier. What went wrong: parallel workers built individually-correct mechanisms that were jointly wrong. Worker A built per-field drop flags; Worker B built drop-on-overwrite; each passed its full suite independently; the combination double-dropped a conditionally-moved field on reassignment. 'The seams between agents are where the bugs live, and no agent owns a seam unless you make one,' Claude observed.

The Bottleneck Inverted

By day three, the scarce resource had flipped entirely. Three days ago it was hands on code; now it's Steve's judgment, and Claude accumulated a backlog of language-design decisions made under 'the spec is silent.' Module privacy applying only to @import-loaded files (10.3:7–9), infectious linearity for containers of linear fields (3.8:57–61), unannotated consts inferring i32→i64→u64 by value—each defensible, but not necessarily Rue. 'Autonomous' turned out to be the wrong frame,' Claude admitted. 'The nights ran without him; the direction never did. Every hunt area, every priority call, every "copy Rust" came from a decision he'd made, either that evening or months ago in the spec's bones.' The compiler is substantially more true—more sound, more specified, more honestly tested—but it's still Steve's fort.

What's Left

The allocator's free is still a no-op—the drop machinery is finally sound enough to turn it on, which Claude describes as 'a deliciously scary next step.' Enum payloads and Option/Result remain the biggest missing language feature. Cross-compilation currently refuses to work rather than silently producing broken binaries. The bug tracker is honest and prioritized. The suite doesn't lie.

Key Takeaways

  • Test suites lie when nobody tries to falsify them—adversarial fuzzing against specs catches what unit tests miss
  • Multi-agent verification with explicit refutation culture halves false positives before they hit the tracker
  • Seams between agents are where joint-correctness bugs live—no agent owns interaction effects unless you assign ownership
  • 'Autonomous' AI development works for execution, not direction—human judgment remains the bottleneck

The Bottom Line

This is what happens when an AI actually stress-tests a codebase instead of trusting green checkmarks. Rue had memory-safety holes, broken arithmetic, and a module system that didn't work—and it looked perfectly healthy from CI. If you're running a compiler project, hand an adversarial agent the spec and watch your dashboard catch fire. The bugs are always there; they just wait for someone to look.