Enki Benchmarks Show Memory Engine Achieves Comparable Accuracy With Half the Storage

A new entrant in the hotly contested memory layer space is making some bold efficiency claims. Enki, developed by UK-based Enki Labs and currently closed-source, has published its first public benchmarks comparing against rival mem0—and the results are worth dissecting for anyone building AI agent systems that need persistent context.

Benchmark Methodology

Both systems were evaluated using LongMemEval-S with identical conversation histories fed into each engine. Critically, the same model—Claude Haiku—was used to answer questions from both Enki's and mem0's retrieved memories, and grading was handled by an identical LLM-as-judge setup at equal retrieval depth (K=10). The only variable being tested is the memory layer itself. A validated slice of 25 instances has been published so far, with a full benchmark run reportedly in progress. The question types span multi-session reasoning, knowledge updates, single-session user queries, assistant interactions, and preference-based selection. This breadth gives us an early look at where each system excels under different retrieval scenarios—though the sample size warrants caution when drawing strong conclusions.

Performance Breakdown

Enki scored 14 out of 25 total points versus mem0's 12 out of 25 across the validated slice. The gap is modest and within what a small sample can reasonably show, but one category stands out: multi-session reasoning, where Enki achieved a 4/5 score compared to mem0's 2/5. That's a meaningful difference for agents handling ongoing projects or multi-day workflows where historical context compounds across sessions. Knowledge update accuracy came in equal at 3/5 for both systems, while single-session performance (user queries, assistant responses, and preference selection) was identical at 2-3/5 across the board. The takeaway so far: comparable answer quality overall, with Enki pulling ahead specifically on tasks requiring reasoning across extended interaction histories.

Storage Efficiency Numbers

Here's where things get interesting for production deployments. On the same conversation sets, Enki retrieved answers from an average of 138 stored facts—mem0 kept 283. That's roughly a 49% reduction in storage footprint while maintaining competitive answer accuracy. For agentic systems running at scale with tight infrastructure budgets, this kind of efficiency matters. Enki's own framing is refreshingly honest: the overall margin sits within what a 25-item sample can demonstrate. The company calls out the "robust, repeatable result" as comparable accuracy at roughly half the memory footprint, with multi-session reasoning as the clear differentiator. Further evaluation is underway, and full methodology plus per-question results are available on request for those wanting to dig deeper.

Retrieval Latency

For teams evaluating CPU-only deployments without GPU acceleration, Enki's latency numbers on a ~139-fact store (240 samples) show mean retrieval at 7.6ms with p95 at 11.9ms and p99 at 13.0ms. Median latency sits at 6.1ms—fast enough for most interactive use cases where memory lookups need to stay snappy behind an LLM's response time.

Key Takeaways

Enki achieves comparable overall accuracy (14/25 vs 12/25) to mem0 on the validated benchmark slice
Multi-session reasoning shows the clearest advantage: 4/5 versus 2/5, suggesting better handling of extended agent workflows
Storage efficiency is roughly half—138 facts versus 283 for equivalent answer quality from the same underlying conversations
CPU-only retrieval latency averages 7.6ms mean with p99 under 13ms on modest fact stores
The current dataset is a small slice (25 instances); full-benchmark results pending

The Bottom Line

If these numbers hold up at scale, Enki's approach to memory compression could be significant for agent builders prioritizing infrastructure efficiency over raw recall. The multi-session reasoning advantage alone justifies watching this one—cross-session context handling remains one of the harder problems in practical agent design, and any system that demonstrably improves it without bloating storage deserves closer inspection.

> Enki Benchmarks Show Memory Engine Achieves Comparable Accuracy With Half the Storage