If you're running a single AI-powered code review and calling it done, you're leaving critical vulnerabilities on the table. New research from Semgrep, SWR-Bench, and Ericsson quantifies what many developers have suspected: one pass with an LLM finds less than half of your actual issues—and each subsequent run surfaces completely different problems based on how the model "anchors" its attention.

The Anchoring Problem

LLMs don't read code like compilers do. They generate responses probabilistically, meaning temperature-based randomness affects what they catch. More critically, the anchoring effect means if a model hooks onto null-safety issues in its first pass, it might completely miss race conditions hiding elsewhere. Feed an entire codebase into one prompt and you're essentially asking a distracted auditor to skim your entire organization in thirty seconds—some things will get flagged, but coverage will be spotty at best.

What the Research Shows

Semgrep's 2025 study documented a "Consistency Gap" where identical security prompts on the same codebase produced three, six, and then eleven distinct findings across three runs. They attributed this to "context compaction"—the model uses lossy compression when processing large codebases, inevitably losing track of specific details. Meanwhile, SWR-Bench found that running reviews ten times with a self-aggregation strategy boosted recall by 118%, proving single-pass coverage is fundamentally incomplete. Ericsson engineers took a different approach: instead of one monolithic prompt, they deployed specialized prompts focused on Security, Logic, Design, and I/O separately, restricting each agent's attention to the enclosing method. This targeted strategy significantly reduced hallucinations while improving precision—a stark contrast to the "naive" single-request method that most teams default to.

Six Runs, Six Different Realities

The author's own experiment drove this home. Running reviews on a real-world WebSocket project six times produced entirely different findings each pass: one run flagged architecture issues (in-process client storage killing scalability), another caught N+1 Redis query patterns and missing TTLs, while a third finally noticed auth tokens leaking into logs via query parameters. Six runs weren't repetitions—they were six incomplete lenses on the same codebase. When comparing approaches head-to-head, six traditional reviews combined still missed two critical vulnerabilities: a docker-compose file injecting a default API key via ${API_KEY:-default-value} that silently bypassed all app-level security checks, and Redis exposed externally without password protection. The structured module-based approach caught both on its very first pass.

Solution: Divide and Conquer

The winning strategy isn't running reviews more—it's running them smarter with two tiers: First, run parallel module reviews where you divide the project by functional boundaries (handlers, store, infra) and spawn fresh sessions for each chunk. A new session resets the anchor; agents don't carry forward what previous runs found, forcing fresh investigation. Second, dedicate an integration pass specifically to boundary issues—interfaces, contracts, shared assumptions between modules are where the deadliest bugs actually live.

Key Takeaways

  • Don't feed entire codebases into one prompt—this burns tokens and kills accuracy through context compaction
  • Break reviews by functional responsibility, not file structure: handlers, store, AI clients, WebSocket, infrastructure
  • Spawn separate agents in fresh sessions for each module to eliminate anchoring effects
  • Give each agent category-specific checklists (Security, Performance, Logic) but ask for one additional sweep afterward
  • Always add a dedicated integration pass—the most critical vulnerabilities hide at module boundaries

The Bottom Line

One AI code review isn't comprehensive coverage—it's a spot check masquerading as quality assurance. If you're shipping code based on single-pass results, you're playing Russian roulette with your security posture. The fix isn't running more reviews; it's making each run laser-focused on less code in isolation, then stitching together the picture at the end.