Why Retrieval Quality Beats Model Size: Inside VizLab's Production RAG Architecture

Large Language Models are powerful, but they have a dirty secret in engineering environments: they hallucinate configuration syntax, invent Terraform flags, and answer from training data instead of your actual infrastructure docs. That's the problem VizLab.xyz solved by building a retrieval-first RAG architecture specifically for technical documentation—and the lessons from that build should terrify anyone who thinks "just add a vector database" is production-ready AI.

The Hallucination Problem Nobody Talks About

Generic LLMs can explain Docker or NGINX concepts at a broad level, but when engineers need accurate commands for their specific stack, generic responses become liability. A model generating incorrect IAM policies or broken NGINX directives doesn't just return wrong answers—it risks infrastructure stability. VizLab's approach was architectural: trust only indexed documentation, retrieve relevant context first, and force the LLM to answer primarily from that retrieved content. This isn't a chatbot with RAG bolted on—it's a retrieval system that happens to use an LLM for synthesis.

Hybrid Search: FAISS Meets BM25

The core insight driving VizLab's architecture is combining dense vector search (FAISS) with sparse keyword matching (BM25). Dense retrieval handles semantic understanding and paraphrased questions beautifully—but it struggles with exact CLI commands, configuration syntax, and version identifiers. BM25 solves the keyword gap. The system merges results using Reciprocal Rank Fusion, scoring chunks by their combined ranking across both methods. This dramatically improves performance for engineering queries involving specific tooling flags or infrastructure terminology that semantic similarity alone would miss.

Query Rewriting and Re-Ranking: Closing the Gap Between Users and Docs

Naive RAG systems fail because users don't ask perfectly structured questions. "How do I configure that for NGINX?" requires understanding conversational context—which VizLab handles through query rewriting that expands queries into multiple search variants before retrieval. A question about container security might internally generate searches around Docker isolation, runtime permissions, and TLS hardening simultaneously. After retrieval, a custom re-ranking step boosts chunks based on keyword density, domain matching, and even introductory section preference. The result is responses that actually match what engineers need—not just semantically close approximations.

Conversational Memory Without the Chaos

The memory system uses an in-memory sliding window maintaining six user turns and six assistant turns before automatic truncation. This prevents prompt explosion and token overflow while enabling follow-up conversations where "configure that reverse proxy" actually makes sense. The architecture influences two stages: query rewriting (understanding what "that" refers to) and final prompt compilation (injecting conversation history). For single-instance deployments, this Python dictionary approach is lightweight and fast—but the team notes it would need Redis or distributed memory for multi-replica scaling.

Infrastructure That Actually Works in Production

The deployment stack includes FastAPI, Gunicorn, AWS Bedrock with Titan Embeddings, Caddy reverse proxy, Tailscale private networking, and Docker containerization. Critical design decision: the backend never touches public internet—it's locked inside a Tailscale mesh network with Caddy handling TLS automatically. Raw scraped documentation gets dumped to S3 before any processing, preserving source material even if downstream chunking or embedding pipelines fail. Garbage input produces garbage embeddings, so cleaning matters more than most developers realize.

What Actually Broke (And How They Fixed It)

The hardest problems weren't LLM APIs—they were infrastructure and operational consistency. Dockerizing FAISS native bindings alongside Bedrock credentials caused repeated CI/CD failures. Tailscale + Caddy networking with proper TLS handling, headers, and browser compatibility took multiple debugging iterations. But the most instructive failure was prompt instability: early versions occasionally ignored retrieved context entirely, producing weak citations and hallucinated explanations. The fix required extensive retrieval refinement alongside prompt engineering—a lesson worth tattooing on every RAG developer's forehead.

Key Takeaways

Retrieval quality matters more than model size in production RAG systems
Hybrid search combining FAISS and BM25 handles both semantic understanding and exact syntax matching
Query rewriting expands poorly-formulated questions into multiple retrieval variants
Re-ranking based on keyword density, domain matching, and chunk position improves precision significantly
Raw documentation should be preserved to S3 before processing for debugging visibility
Conversational memory needs distributed backing (Redis) for multi-replica deployments

The Bottom Line

Building RAG systems isn't "connecting an LLM to a vector database"—it's serious systems engineering where retrieval pipelines, chunking strategies, caching layers, and prompt orchestration determine whether the system feels reliable or like another AI demo that falls apart in production. As these architectures mature, the engineering surrounding retrieval may matter more than the models themselves.

> Why Retrieval Quality Beats Model Size: Inside VizLab's Production RAG Architecture