If you've been evaluating agent memory libraries lately, here's what the vendor docs won't tell you straight up: most of these systems aren't implementing cognitive science's four kinds of memory—they're building a user profile database and borrowing terminology from Endel Tulving's 1972 chapter on episodic versus semantic memory to dress it up. The vocabulary is lifted; the engineering underneath doesn't match.
Anatomy of an Agent Memory System
Every agent memory library breaks down into three components, regardless of what they call themselves. First, there's the extractor—the thing that reads conversation transcripts and decides what to keep. Usually it's an LLM call with a typed output schema producing short abstracted facts about the user or task. The most consequential choice here is timing: extract after every message and you're burning tokens on small talk; extract at session end and you've already lost the context you needed for pronoun resolution. Neither approach is wrong, but each loses what the other preserves. The store is next—your database layer handling vector indexes for semantic similarity, relational tables with filterable columns, or knowledge graphs with typed edges connecting entries. Here's where things get interesting: when a new statement contradicts an old one (user lived in Paris until April, then moved to Amsterdam), how does the system handle it? Overwrite and lose history? Append both and let retrieval sort it out? Mark the old as superseded? A store that can't answer 'what did I believe last month?' isn't a memory system—it's a timestamped snapshot. The retriever turns queries into searches at runtime, returning statements most likely to be relevant. Vector similarity is baseline; keyword search on top is common; rerankers are the standard third layer. Some libraries add time filters and presupposition checks that block retrieval when the question itself assumes stale facts. Structurally this is RAG—the corpus just happens to be accumulated user statements instead of a document library.
The Four Kinds of Memory
Cognitive science's canonical taxonomy includes episodic (specific events tied to time and place), semantic (decontextualized facts about the world), procedural (how to do things—muscle memory, essentially), and working memory (the context window). Most agent libraries handle none of these cleanly. Episodic gets compressed into semantic at extraction time: 'user mentioned they prefer TypeScript over coffee on Tuesday' becomes just 'user prefers TypeScript.' The situated event is gone; what remains is a fact. Procedural memory is the litmus test for whether libraries are being honest about their capabilities. LangMem treats it as a distinct mechanism—evolving the system prompt from scored trajectories so what's remembered isn't a retrievable fact but behavioral disposition encoded in instructions. Mem0 exposes the procedural label but writes it into the same index used for facts, with metadata.memory_type = "procedural" as the only difference. Graphiti doesn't expose procedural memory at all; everything lands in the same bitemporal graph regardless of source. Same vocabulary, completely different engineering underneath.
Where Biology Gets It Wrong
The analogies to biological memory are useful for vocabulary but dangerous as design guides. Consolidation—the slow compression from situated experience to decontextualized fact—has a real analog worth importing: Anthropic's Dreams and Letta's sleep-time compute run offline passes over accumulated material, deduplicating and resolving contradictions between sessions. That's the version that matches biology. Libraries running extraction synchronously on every message are doing consolidation under live latency budgets—a degenerate case. But some biological properties shouldn't transfer. Emotional salience—where the amygdala flags experiences with strong affect for stronger encoding—is structurally absent from text-only agents. There's no body, no autonomic system producing fear or surprise signals. Attempts to add importance scoring via LLM-judged proxies (Park et al.'s Generative Agents rate memories 1-10 on poignancy) are just the same model that lacks affect estimating it—a structural gap that follows from operating on text alone. Forgetting is where most libraries get it backwards. Biological memory forgets because it can't afford to store everything—the constraint, not the feature. An agent system has no such constraint; disk is cheap and a system that keeps everything can answer 'what did we know last March?'—auditable, debuggable, often what users actually need. The framing 'biological memory forgets so agent memory should too' imports the constraint as if it were the lesson.
Key Takeaways
- Every library breaks down into extractor, store, and retriever—read any docs by knowing these parts first
- Procedural memory is a litmus test: check whether it's a distinct mechanism or just metadata labeling
- Episodic gets compressed to semantic at extraction—you're losing the situated event regardless of what the docs say
- Biological forgetting is a constraint, not a feature—don't import it as design guidance
The Bottom Line
The vocabulary around agent memory is more stable than the products building on top of it. When you evaluate these libraries, place their choices on this map before trusting the marketing: which kinds of memory they actually handle, which anatomical parts are real versus stubbed out, where they took the cognitive science terminology without doing the engineering underneath. The field's central problem is narrower than 'memory'—and clearer when you name it as autobiographical semantic memory with extra steps.