Stefano Petrilli wanted to build a multi-agent video editor that would take long videos and automatically cut out the fluff, leaving just the good stuff. The pitch was simple: upload a video, get back a tight edit with all the filler removed. What he got after a weekend of iteration exposes exactly why most AI agent pipelines fail in production—and what actually works.
The Lost-in-the-Middle Problem
The first architecture had three agents: an Editor that parsed full transcripts and selected cuts, a Reviewer that validated those choices, and FFmpeg stitching everything together. On paper it was elegant. In reality? Garbage output. Petrilli discovered the root cause in a 2024 paper called 'Lost in the Middle': transformer architectures oversample the beginning and end of their context window while systematically ignoring content in the middle. For video editing, this is catastrophic. Creators typically put summaries at the start, so LLMs—which are literally hardwired to overweight early tokens—decide the intro IS the video. The juicy 20-minute deep-dive buried in the middle? Invisible to the model. Petrilli's fix was elegant: add a Topic Agent that extracts the core message first, then feed it to downstream agents as [core message] + [full transcript] + [core message]. This repetition forces models to pay attention throughout.
The Compound Bias Problem
Here's where things get ugly for multi-agent debate enthusiasts. Petrilli expected his Editor and Reviewer agents to argue it out, iterate, and converge on better cuts together. Instead, the Reviewer became a rubber stamp—always approving whatever the Editor suggested. A paper titled 'Peacemaker or Troublemaker' explains why: LLMs have inherent sycophancy that collapses debates into premature consensus. The math is equally brutal. When evaluator error couples with generator error—as it inevitably does when you ask one instance of a model to judge its own output—self-evaluation becomes non-identifying. Agreement provides negligible evidence of correctness. Petrilli's empirical test confirmed this: using DeepSeek V4 Flash for both Editor and Reviewer resulted in zero rejections of the first proposal. Switching the Reviewer to a different model family immediately fixed it. Different biases cancel out.
Whisper Isn't a Silver Bullet
Because everyone's hyping Whisper, Petrilli assumed it would handle his speech-to-text needs perfectly. Wrong. Whisper was trained on massive datasets of internet videos with subtitles, which conditioned it to chunk text based on screen visual constraints and acoustic pauses—not grammatical boundaries. The result: notoriously weak timestamps and logical sentences split mid-thought when speakers paused for breath. WhisperX improves timestamping but integrates poorly with custom stacks. Petrilli landed on Vosk instead: similar transcription quality, better Acoustic Alignment, proper Voice Activity Detection. An underdog nobody talks about beat the industry standard for this specific use case. Classic.
The Final Architecture
After beating through these failure modes, Petrilli's workflow now looks like: Speech-to-Text → Topic Agent (extracts core message) → Editor + Reviewer using different model families → Video Editing with FFmpeg. The before-and-after videos speak for themselves—the revised version actually preserves the narrative. - Lost-in-the-middle is architectural, not fixable by upgrading models—use repetition and core message extraction to force attention throughout context - Never use identical model families for agents that debate each other; biases compound and kill productive disagreement - Whisper's reputation doesn't mean it's optimal for every speech-to-text task; evaluate alternatives like Vosk for timestamp-critical applications
The Bottom Line
Multi-agent AI systems are fragile in ways the hype refuses to acknowledge. Petrilli's weekend project encountered three distinct failure modes that would derail any production pipeline—yet none of them appear in the marketing materials for agent frameworks. Before you ship your own multi-agent workflow, stress-test it against these exact scenarios or watch it crumble when users actually try to use it. The code is on GitHub if you want to poke around yourself.