Three weeks ago, an engineer shipped a breaking change to production with 142 passing tests and an approved code review. Four hours later, a mobile client started returning 500 errors. The system didn't malfunction—it worked exactly as designed. That's what makes this story terrifying.
What Actually Went Wrong
The agent was asked to clean up a user response DTO. A phoneNumber field that sometimes came back null got refactored: instead of serializing "phoneNumber": null, the agent simply omitted the key entirely when there was no value. Cleaner payload. Smaller response body. Seemed like an obvious improvement until an older Android build—on a screen nobody had touched in over a year—tried calling .length on a field that no longer existed. Field present-but-null? Fine. Field missing? Crash. And you can't force-update an app sitting in a tunnel somewhere.
The Yes-Man Test Problem
Here's the part that kept this engineer up at night: every safety net worked perfectly. CI was green because the agent wrote both the implementation and the test in the same pass. When it changed the DTO, it updated the assertion to match the new shape of the data it just created. The test didn't fail to catch the breaking change—it ratified it. Before: assertThat(json).contains("\"phoneNumber\":null"). After: assertThat(json).doesNotContainKey("phoneNumber"). That green checkmark stopped being a description of what the system promises and became a description of what the system currently does. Those are fundamentally different things, and every breaking change lives in that gap.
Why Agents Are Structurally Blind
It's not about dumb models. A breaking change is invisible from inside the repo by definition. The blast radius of a missing field doesn't live in your backend—it lives in an Android app you don't own, a partner's integration you've never seen, a Zapier workflow some customer built two years ago, or an LLM-powered agent on the other end parsing your API that will now hallucinate around the absent key instead of erroring. The agent optimizes for what it can see: does it compile, does the test pass, does it satisfy the prompt? Compatibility is a property of the boundary between systems, and agents only ever see one side of any boundary.
Why Code Review Doesn't Save You
Humans review terribly when rubber-stamping 600 lines of plausible, well-formatted code that an agent generated in nine seconds. Volume is the attack vector. The industry quietly traded a 10x increase in code output for a 10x increase in silent contract drift. When you wrote your own tests, there was an implicit independence—someone had to think about what the system should do separately from how it currently does it. When the same agent writes both sides of that equation, that independence evaporates entirely.
What Actually Works
Freeze your OpenAPI spec or Pact contract and diff generated output against it in CI—not "do the tests pass," but "did the response shape change in a way that breaks a consumer?" Field removed, type narrowed, required-now-optional, enum value dropped, 200→204. Those five patterns cost real money and they're all mechanically detectable. Treat agent-written tests as guilty until proven independent: make the agent write characterization tests against old behavior first, before it's allowed to touch implementation. And put breaking-change detection in CI where it belongs—humans are bad at spotting a missing key in a 600-line diff. Computers are perfect at it.
Key Takeaways
- AI-generated tests lose their independence when written alongside the code they verify
- "Yes-man tests" ratify whatever behavior was just implemented rather than catching regressions
- Breaking changes live outside your repo—in client apps, partner systems, and customer integrations you can't see
- Contract specs (OpenAPI, Pact) should be the source of truth, not test assertions
- Characterization tests written before implementation changes are essential for agent-assisted codebases
The Bottom Line
The bottleneck was never writing code—agents proved that. The bottleneck is knowing whether what you just wrote broke someone who isn't in the room. We got very good at generating more code faster. It's time to get equally serious about verifying we didn't break something in the process.