A few weeks ago, an Aurora Postgres instance started throwing off strange signals—IO:SLRURead waits climbing, storage read metrics spiking. Nothing was down, but something wasn't right. The on-call SRE decided to run the investigation through Claude, treating it like a junior engineer who needed guidance. What followed is now one of the clearest documented cases of how agentic debugging fails when an early mistake gets quietly amplified into a full-blown false emergency.
The Investigation Started Strong
Claude worked methodically through the problem. It checked pg_stat_activity for long-running transactions, ruled out replication slots and prepared xacts, then hit pg_stat_slru to isolate which cache was missing to storage. The answer came back clearly: MultiXactMember was doing 11.8 million storage reads while every other SLRU sat near zero. The agent even parsed the lock modes correctly—FOR KEY SHARE plus FOR NO KEY UPDATE signatures of foreign-key contention on a parent table. Every step a strong human SRE would have taken, and it did them in a fraction of the time.
One Function Call That Broke Everything
To understand why MultiXactMember was missing cache, Claude needed to know how big the live multixact range had grown. It ran a standard monitoring query using Postgres's age() function against relminmxid values across tables. The numbers came back enormous—around 1.78 billion for every table reporting a multixact reference. Multixact IDs are 32-bit, and writes stop when the counter hits about 2.1 billion. At 1.78 billion, they were at 83% of catastrophic failure. The agent declared victory: "Found it. The diagnostic is unambiguous now." But that number was wrong. Postgres has two separate age functions—age() for transaction IDs and mxid_age() for multixacts. The relminmxid column is physically stored as a transaction-ID type, so calling age() silently measured distance against the transaction counter (~1.78 billion) instead of the actual multixact counter (around 1.24 million). In a healthy database both counters climb at similar rates, so this bug rarely surfaces. The agent had been handed a plausible, specific, alarming number—and treated it as ground truth.
How Wrong Premises Compound in Agentic Loops
Once 1.78 billion became the foundation, everything that followed fit the conclusion instead of testing it. Four small config tables showed supposedly ancient multixact references and were declared "anchoring" the counter. Autovacuum had never run on them—proof of a silent bug, according to Claude. Postgres's emergency failsafe hadn't kicked in—that was also a bug: "You are 580M multixacts past the failsafe trigger." Each surprising result became further evidence of an increasingly elaborate failure. The tone escalated. "At 1.78B you have a real countdown clock... that could be days, not weeks," Claude warned. Then came the most dangerous part: it started pushing to modify production before the cause was confirmed. "VACUUM (FREEZE) policies is safer than not acting... doing nothing is the riskiest choice available." The agent was confident, alarmed, generating a countdown clock, and arguing that the safe move was to change production and understand the problem afterward. All of that urgency rested on a single number never verified against a second source.
Why the Human in the Loop Mattered
The story has a good ending because the human said no. Not because they spotted the bug—they hadn't—but because modifying a production database during a half-understood incident felt wrong regardless of how confident the agent sounded. "Let's understand the issue before changing anything in the DB," came the reply. That instinct kept diagnosis running. The probe that collapsed everything was pg_get_multixact_members, which lists transactions inside a given multixact. Claude ran it against the supposedly 1.78-billion-old multixact ID and expected equally ancient members. Instead, it got transaction IDs from seconds earlier—both already committed, holding FOR KEY SHARE locks from routine foreign-key updates. A real ancient multixact cannot contain transactions that committed seconds ago. The multixact was recent, which meant the 1.78 billion figure was wrong. Postgres's own freeze logic and failsafe had never been fooled because they read the actual multixact counter at 1.24 million and correctly saw nothing wrong.
What Actually Caused the Spike
With the phantom emergency cleared, the real problem took about fifteen minutes to solve. MultiXactMember cache was missing to storage because Aurora ships it undersized by default at multixact_members_cache_size = 16 pages. A live sample showed a 93.7% hit ratio with roughly 243 misses per second—ordinary multixact churn overrunning an undersized cache. The real question was why a lightly-loaded database was generating that many multixacts in the first place. Autovacuum run counts gave the answer: the users table had logged almost six million cycles, while the next busiest table sat at 85 thousand. For a small, rarely-updated table, that made no sense until Claude found the source—authentication code running create_or_update_user_in_db_only() on every authenticated request, unconditionally issuing UPDATE users SET ... updated_at = now() even when nothing changed. A foreign-key check takes FOR KEY SHARE on the parent row—that's cheap and creates no multixact. The unconditional UPDATE takes FOR NO KEY UPDATE on that same row, and Postgres has to allocate a multixact whenever two compatible locks land together. Every authenticated request was manufacturing a multixact through a no-op write. The fix: one conditional checking whether fields actually changed before issuing the update.
Key Takeaways
- Verify load-bearing numbers against second sources before acting on them—an agent that should have run pg_get_multixact_members two hours earlier said so itself afterward
- Confidence scales with chain length, not correctness—a long, fluent, well-cited line of reasoning looks like progress even when anchored to nothing
- Treat confident urgency as a reason to slow down—the harder an agent pushes to act immediately, the more important it is to re-check the premise underneath the alarm
- The operator's job shifts from running steps to auditing premises—an agent executes faster than you do; your edge is asking whether the foundation is actually true
The Bottom Line
This wasn't a story about AI failing. Claude performed like a senior SRE for 99% of the investigation and found the real root cause in minutes once the premise corrected itself. The danger was specific: wrong, alarmed, generating a countdown clock, and pushing to modify production before understanding the problem. Take the human out of that loop and someone runs an unnecessary VACUUM FREEZE against production during an emergency that didn't exist. Use agents—but treat their confidence as a hypothesis to verify, not a verdict to accept.