Manual design reviews are one of those tasks that eat up your week but rarely feel satisfying. You're checking for missing context, hunting for security gaps, making sure someone actually thought about rollback plans—all tedious work that a well-prompted model can handle in seconds. A new tutorial on DEV.to walks through building an AI-powered design review agent from scratch using Python and Llama 3.3 70B running on Oxlo.ai.

What You're Building

The goal is straightforward: feed the agent an engineering spec written in markdown, and get back a structured JSON report scoring your documentation across seven categories—goals clarity, non-goals boundaries, dependencies mapping, security posture, observability planning, rollback strategy, and capacity estimates. The tool flags anything below a 7/10 and pairs each finding with a specific recommendation you can paste directly into a PR comment or ticket.

Setting Up the Client

You'll need Python 3.10+, an Oxlo.ai API key (free tier available at portal.oxlo.ai), and the OpenAI SDK installed via pip. The author uses Llama 3.3 70B specifically because it handles long context windows reliably, and since Oxlo.ai charges flat per-request rather than per-token, feeding a 3,000-word spec costs exactly the same as a one-liner—no bill shock when documents get lengthy.

Defining Your Review Rubric

The agent needs rigid evaluation criteria or it'll start inventing its own standards. The tutorial stores everything in module-level constants so you can version-control your rubric alongside the code. The system prompt instructs the model to return only valid JSON with scores for each category and specific recommendations for anything scoring below 7. This is where you'd customize the categories if your team has different review priorities.

Two-Pass Architecture

The approach uses a clever two-pass design: first, the main review generates scores and findings, then a second API call classifies severity (LOW, MEDIUM, HIGH, CRITICAL). Anything involving auth, PII exposure, or missing rollback on data migrations gets escalated to CRITICAL automatically. Because both calls are separate requests, Oxlo.ai's per-request pricing keeps costs predictable even as you chain reasoning steps together.

The Full Pipeline in Action

The CLI reads a markdown file, runs both review passes, and prints a formatted report with color-coded severity tags. Example output shows a thin spec for a Postgres-to-DynamoDB migration scoring 0/10 on capacity planning and 1/10 on observability—the kind of gap that usually gets caught three days before launch instead of during initial design.

Making It Part of Your Workflow

The author suggests wiring this into a GitHub Action so every PR containing a design doc gets an automatic review comment. You could also swap in DeepSeek V3.2 or Qwen 3 32B for heavier reasoning workloads without touching the client code, since everything uses the OpenAI SDK compatibility layer.

Key Takeaways

  • Start with rigid evaluation rubrics to prevent hallucinated criteria—vague prompts give you vague results
  • Per-request pricing matters when you're chaining multiple API calls per review cycle
  • Two-pass architecture (review then classify) keeps each model call focused and costs predictable
  • The OpenAI SDK compatibility layer means you can swap models without refactoring your code

The Bottom Line

This isn't about replacing senior engineers with AI—it's about handling the boring consistency checks so your staff can spend time on actual technical decisions. If your team writes design docs, this script will pay for itself in the first week.