Meet Graft: The Local-First Memory Layer Your AI Coding Agent Desperately Needs

If you've spent any real time with AI coding agents like Claude Code or OpenAI's Codex, you know the drill: they're brilliant in a session and utterly useless the next morning. That three-hour debugging session? Gone. The architectural decision your team agonized over? Forgot it exists. Every fresh context window starts from zero, and developers end up repeatedly explaining the same project quirks to models that should already know better.

The Memory Problem Nobody's Talking About

Graft tackles this head-on with a local-first approach that keeps agent knowledge persistent across sessions, machines, and context resets. Built in C11 on top of SQLite, sqlite-vec for vector operations, FTS5 for full-text search, and BGE-M3 embeddings via llama.cpp, Graft runs entirely on your machine—no cloud dependency, no API key required. The whole system ships as a single binary with one database file. You install it with Homebrew or a shell script, run graft stats, and you're done. No daemon to babysit, no models to download manually, no YAML configuration nightmares.

How Graft Actually Works

The architecture splits into a thin CLI client communicating over an AF_UNIX socket to a graftd daemon that handles all the heavy lifting. When you query Graft with something like graft query "spring boot validation cascade nested DTO", it fires off hybrid search combining dense semantic embeddings (BGE-M3 cosine similarity) with lexical BM25 scoring, fused via Reciprocal Rank Fusion. But here's what separates Graft from generic vector databases: the verify gate. Before returning a result, Graft runs trigram-Jaccard plus cosine similarity through a confidence check that spits out STRONG, WEAK, or MISS—no hallucinated hits, no confident nonsense. A STRONG hit means both semantic and lexical signals passed muster. WEAK means only semantic made it through. MISS means the cache couldn't verify anything useful. The insert pipeline is equally deliberate. When you save a memory with graft insert --title "..." --body "..." --keyword spring-boot, Graft embeds the title, upserts keywords, runs vector_topk across keyword and semantic edges with MMR diversity to avoid redundant entries, then commits everything atomically. The result: weeks later, asking about cascading validation annotations in semantically different phrasing still returns your original memory as a verified STRONG hit.

Built for Real Developer Workflows

Graft ships integrations for Claude Code (via skills), Codex (skills), ChatGPT and Claude Desktop (MCP server), Gemini CLI (GEMINI.md hook), and Open Code. The MCP implementation supports both stdio transport and an HTTP gateway with OAuth for microservice access. Each integration includes skill definitions that teach the model when to search Graft's memory versus when to save new findings—crucial context your agents would otherwise lose forever. The multi-tenant profile system isolates different workspaces (work, personal, project-scoped) into separate SQLite databases and daemon sockets. Export, import, and merge operations work on plain SQL files you can back up with cp graft.db dest/. Optional GPU acceleration targets NVIDIA CUDA or AMD ROCm 6/7 via build flags for teams running beefy workstations.

The Microservice Cache Pattern (Experimental)

Beyond individual agent workflows, Graft's creators outline an intriguing L1 Redis + L2 Graft semantic cache + L3 Graft + LLM tiered architecture. Exact prompt matches hit Redis at RAM-byte cost; paraphrase-aware lookups tap the verified STRONG/WEAK/MISS gate in Layer 2 on local CPU; only confirmed misses route to the full LLM synthesis pipeline with writeback for future calls. No benchmarks published yet, and the cross-encoder reranker remains a stub (returns -1), but the design pattern is sound for teams wanting to trim token costs without sacrificing recall quality.

What's Still Rough

Graft's honest disclosures are refreshing: active alpha status means APIs and storage formats may shift before 1.0. The neural cross-encoder reranker using BGE-reranker-v2-m3 is on the roadmap but not wired up yet—today's verify gate relies on trigram-Jaccard plus cosine, which handles most corpora fine. Team-shared memory across machines isn't shipped; today you're limited to per-machine local profiles with manual export/import cycles.

The Bottom Line

Graft fills a gap that's been haunting AI-assisted development since day one: ephemeral context windows that wipe hours of accumulated insight. It's not trying to be another vector database or RAG framework—it's the persistent reasoning layer sitting above your file search index, built by developers who clearly feel the pain of re-explaining project conventions every single session. If you're serious about shipping code with AI agents in 2026 and want memory that actually sticks around, Graft is worth a weekend experiment.

> Meet Graft: The Local-First Memory Layer Your AI Coding Agent Desperately Needs