A developer known as codecowboy has open-sourced an experimental multi-agent system where AI agents write code, score it for quality, and refine it—all without human intervention. The pipeline runs in a loop until the generated code hits a configurable quality threshold, then executes it as an actual subprocess. It's a clean proof-of-concept for what fully autonomous code generation could look like.
How It Works
The system uses three distinct agents with different roles. Agent 1 is the generator—prompted to respond only with raw Python code, no markdown, no commentary. It creates what becomes Agent 2: an actual script that gets executed later as a child process. The scorer evaluates that generated code against the original prompt, and if it falls below threshold, the refiner applies changes based on structured diff feedback before cycling back for another score.
Scoring With History
The scoring mechanism is where this gets interesting. The scorer uses claude-haiku-4-5-20251001—a cheaper, faster model—and returns a structured diff format with exact REMOVE/ADD pairs rather than vague suggestions like 'improve error handling.' More importantly, the full history of previous attempts gets passed to each scoring pass. Without it, scores regress—the scorer forgets what it already rewarded and starts penalizing things it previously accepted. Codecowboy notes they learned this the hard way on early runs.
Model Tiers for Different Jobs
The generator and refiner both run claude-opus-4-8 because applying structured diffs correctly requires solid reasoning capacity. The scorer gets by on haiku since evaluation is cheaper work. These aren't arbitrary choices—the article emphasizes deliberate model selection based on what each role actually needs to do well.
Configurable Thresholds
Two constants control the loop: MAX_REFINEMENTS = 3 and MIN_SCORE = 9.6 out of 10. If code hits 9.6, it gets accepted. Otherwise it's refined up to three times before the script exits with a non-zero code. Once accepted, Agent 1 writes the final Python to a temp file (with automatic cleanup in a finally block) and runs it as a subprocess—capturing stdout and stderr for debugging.
Key Takeaways
- Structured diff format with exact REMOVE/ADD pairs makes refinement deterministic rather than vague
- Full history passing prevents score regression between iterations—a critical detail
- Different models for different roles: opus for generation/refinement, haiku for scoring
- The system generates actual executable code that runs as a subprocess, not just suggestions
The Bottom Line
This isn't theoretical—it's a working pattern that proves AI agents can handle the write-score-refine loop autonomously. The structured diff approach and history tracking are the real innovations here; they transform flaky 'improve this' prompts into precise surgical edits. Next up: distributed message bus so multiple agents can refine each other across a network. That's when things get genuinely interesting.