Flat Embeddings Can't Follow Edges: Why GraphRAG Dominates Relationship Queries

Standard RAG treats your knowledge base as a pile of text chunks ranked by embedding similarity. That works for factual lookups like 'What is the SLA for Gold tier vendors?' — one chunk, one answer. It breaks completely on relationship questions that require traversing connections between entities. Team BroCode proved this at the TigerGraph GraphRAG Inference Hackathon 2026 with hard numbers: their graph retrieval pipeline hit 96.7% accuracy while consuming 86% fewer tokens and running 17.5% faster than basic vector search.

The Geometry Problem Killing Your Accuracy

The question that exposes flat similarity's fundamental flaw: 'Which customers were impacted by OUTAGE-001 through their shared vendor and region?' There is no document containing that answer. The answer lives in a traversal path — OUTAGE-001 → REGION-FRANKFURT → VEND-01 → [250 customers]. Cosine similarity finds chunks mentioning the outage but cannot follow edges to aggregate connected entities. BasicRAG was given every eval entity's documents present in its index — full coverage, no missing context. It still capped at 71.1% accuracy. The overwhelming majority of failures were multi-entity relationship questions requiring edge traversal that flat search structurally cannot perform. This isn't a tuning problem. It's a geometry mismatch between the retrieval method and the shape of your data.

Building the Test Bed

Team BroCode constructed a synthetic CRM knowledge base with 158.5M tokens across 100,820 documents embedded into 577,175 vector chunks using TigerGraph's native HNSW index (768-dimensional gemini-embedding-001). The schema models typed relationships as edges — Customers depend_on Vendors, Vendors experienced Outages, Customers located_in Regions, and so on. Every edge is traversable at query time.

Two-Phase Retrieval Pipeline

The pipeline seeds with semantic vector similarity to find relevant Document nodes, then runs GSQL multi-hop traversal using SetAccum to prevent revisiting nodes and MapAccum to score chunks by hop distance (1.0 for direct neighbors, 0.5 for second-degree). For the OUTAGE-001 question: seed finds the outage document, Hop 1 traverses to connected vendors and regions with a full context window of only ~1,483 tokens — compared to BasicRAG's retrieval averaging 10,867 tokens.

Evaluation Methodology

Three safeguards against self-scoring bias: an independent Groq Llama 3.1 8B Instant judge model (different family from the Gemini generator), same LLM across all pipelines so accuracy differences reflect retrieval quality only, and canonical BERTScore with roberta-large and rescale_with_baseline=True.

Results That Actually Mean Something

GraphRAG: 96.7% accuracy (87/90), 1,483 avg tokens, 7.5s latency. BasicRAG: 71.1% accuracy (64/90), 10,867 avg tokens, 9.1s latency. LLM-Only baseline: 3.3% accuracy with just 14 prompt tokens — proving the retrieval layer was doing the heavy lifting.

Where Graph RAG Still Has Headroom

The three honest misses were all complex multi-hop aggregation questions requiring filtered joins and counts across multiple hops. The current depth-limited traversal collects connected subgraphs but doesn't express join conditions explicitly — leaving more inference work for the LLM to perform on its own, which it sometimes gets wrong.

What Makes TigerGraph Different

HNSW + GSQL in a single engine eliminates the two-system problem (separate vector DB plus graph DB) that makes most graph RAG implementations impractical. The community edition handled 100K documents and 577K indexed chunks without hitting limits — one Docker container, no external vector database, no managed cloud overhead.

When to Use Which Approach

Flat RAG: document QA, knowledge bases with independent facts, self-contained text per chunk. Graph RAG: any domain where entities have typed relationships — CRM, supply chain, security incident graphs, financial networks, healthcare. If your question contains 'through', 'via', 'related to', 'impacted by' — it's a traversal question, not a similarity question.

The Bottom Line

Flat embedding similarity is the wrong tool for relationship data because it finds related text rather than traversable edges. Team BroCode's graph retrieval approach using TigerGraph Community Edition hit 96.7% accuracy where BasicRAG capped at 71.1%, with 86% fewer tokens and faster execution — not because graph RAG is more complex, but because the structure of the retrieval matched the structure of the data.

> Flat Embeddings Can't Follow Edges: Why GraphRAG Dominates Relationship Queries