Skip to main content
All posts
Engineering11 min read

How AI Agents Actually Remember (An Architecture Field Guide)

Cognitive science has four memory categories. LLM agents need all four, but the architecture for each is different. A walk through working / episodic / semantic / procedural memory in production agents — what patterns work, what breaks, and where teams actually land in 2026.

The framing that's missing

Most discussions of "AI agent memory" collapse into a single shape — a vector database, sometimes with a knowledge graph bolted on, queried with cosine similarity, returned as context for the next prompt. That's not a memory system. That's one component of one kind of memory.

Cognitive science distinguished four memory categories decades ago: working, episodic, semantic, and procedural. Each is a different shape, accessed by different operations, with different retention curves. Production LLM agents need all four — and the engineering patterns that work for each are different.

This essay walks through the four. For each: what it stores, how the LLM uses it, the architectural patterns that have shipped in production, the failure modes, and where teams land in 2026.

Working memory

In humans: the scratchpad. The thing you're actively thinking about right now. Limited capacity, lost the moment you stop attending to it.

In LLM agents: the context window. Working memory is the slice of tokens the model sees on this turn — the system prompt, the conversation history, the retrieved context, the tool definitions. Everything else might as well not exist.

The pattern that works: aggressive context curation. The 200k-token window is a resource to be budgeted, not filled. Most agents waste 60% of their window on stale chat history; the production pattern is to summarize older turns, retrieve only the relevant slice of long-term memory, and reserve the recent N turns verbatim.

Failure mode:the "lost in the middle" problem. LLMs attend strongly to the beginning and end of the context; the middle gets fuzzy. Production agents fight this by placing the most critical instructions at the top and bottom of the prompt, with recall results sandwiched in between but explicitly marked.

Where teams land in 2026: chunked context with explicit role markers, dynamic budget allocation per turn, and summarization passes that compress the conversation history into a one-paragraph state once it crosses ~30k tokens of chat.

Episodic memory

In humans: specific autobiographical events. Where you were, who you were with, what happened. Time-stamped, context-rich, often vivid but unreliable on details.

In LLM agents: turn-level conversation history. What did the user ask last Tuesday? What did the agent answer? Episodic memory lets the agent say "you mentioned your team uses Postgres back in March" instead of asking the same onboarding question again.

The pattern that works:append-only event logs with semantic indexing on top. Every user message, every tool call, every notable agent decision lands as a timestamped row. Retrieval is "find episodes similar to the current question" — vector search over the embedded event text.

Failure mode:recall noise. Vector search on raw chat history surfaces a lot of irrelevant similar-sounding episodes. "What did we say about the database?" returns five conversations from five different projects, none scoped. The fix is per-user, per-project namespacing on the episodic index — and almost no team gets this right on v1.

Where teams land in 2026: a Postgres table with turn_id, user_id, session_id, created_at, and an embedding column. Indexed for vector search with a metadata filter. Hosted memory APIs handle this well; OSS forks of Mem0 and Letta do too.

Semantic memory

In humans: facts and concepts, decoupled from the specific episode you learned them in. You know Paris is the capital of France; you don't remember the exact moment you learned that.

In LLM agents: extracted facts and relationships. Not the raw conversation about Paris; the assertion capital_of(Paris, France). This is where knowledge graphs live. This is where the wiki view lives. This is what most teams mean when they say "memory" without realizing they're only describing one category.

The pattern that works:client-side or server-side extraction passes that turn raw episodes into structured facts, deduplicated against existing facts, conflict-resolved when new facts contradict old ones, linked to related entities via a graph. The retrieval shape is different from episodic — instead of "find similar episodes" it's "walk the graph from this entity to find related ones, plus return the facts attached to each node."

Failure mode:the "facts pile up forever" problem. Without conflict resolution at ingest, the agent ends up storing "the deploy command is bash deploy.sh" alongside "the deploy command is ./scripts/deploy" and recalling both, contradicting itself in the answer. This is the silent-quality failure mode that production teams hit at month three and start looking for a fix.

Where teams land in 2026:a knowledge graph with automatic entity resolution and supersedence rules. Either embedded in the memory layer (Ricord, Zep's bi-temporal graph, Cognee) or built as a custom service on top of a pgvector + Neo4j stack. The expensive part is getting extraction right; the graph itself is cheap once you have well-typed facts going in.

Procedural memory

In humans: how to ride a bike. How to drive a car. Skills and habits, embedded in behavior rather than in explicit recall. You can't describe it; you can only do it.

In LLM agents: instructions, preferences, and standard operating procedures. "Always use TypeScript explicit return types in this codebase." "Format JSON with 2-space indent." "When asked to write a PR description, use the structure from .github/PULL_REQUEST_TEMPLATE.md." This is the layer most teams call "system prompt" or "custom instructions" or CLAUDE.md— and it's the one most-often confused with semantic memory.

The pattern that works: a separate collection of instructions, scoped per project or per agent, injected into the system prompt every turn. NOT mixed into the same vector store as facts and chat history — when instructions get vector-searched, they only show up when their text happens to match the query, which is the opposite of what you want. Instructions should always be present.

Failure mode: instruction drift. Users update procedures; the agent keeps following the old ones because the new instructions never made it into the always-injected prompt. Production agents need a clean update path for procedural memory — and an easy way to see which procedures are currently active.

Where teams land in 2026: a small set of structured procedure objects with id, kind, scope, body, and an updated_at. Always loaded into the prompt at session start. Exposed in a UI users can read and edit directly.

Putting it together — what a production memory layer looks like

A working memory layer in 2026 is the combination of all four:

  • Working memory: aggressive in-prompt context budget, recent-N-turns verbatim, older turns summarized.
  • Episodic memory: append-only event log, vector-indexed, namespaced per user + project, returns top-K similar episodes for the current query.
  • Semantic memory: extracted facts + knowledge graph, deduplicated and conflict-resolved at ingest, retrievable via entity walks or fact filters, browsable as a wiki.
  • Procedural memory: structured instruction objects, scoped per project, always loaded into the prompt, separately editable in a UI.

Most teams build the episodic layer first (it's the cheapest), realize three months later that they need semantic too (because contradictions are eating their credibility), then realize six months later that they should have built procedural memory separately the whole time. The pattern is so consistent it's almost a rite of passage.

Where Ricord fits

Ricord ships all four memory categories as a single MCP-compatible layer. Procedural memory lives in the procedures table and is loaded into ricord_get_context at session start. Episodic memory lives in memories, indexed for sub-second recall. Semantic memory is the knowledge graph and the auto-generated wiki pages — extracted client-side from your conversations, deduplicated and conflict-resolved server-side. Working memory is what your LLM does with the recall results — Ricord doesn't replace it, just feeds it the right slice.

bun add -g ricord
ricord login
ricord install   # auto-detects Claude Desktop, Claude Code, Cursor, Codex

Three commands. Restart your MCP client. The agent now has all four memory categories wired in, plus a browsable wiki view at the dashboard you can read by week two.

If you're going the OSS route — fork Mem0 or Letta and stand it up — the four-category framing above is the blueprint for what you'll need to build over the next quarter. Procedural and semantic are the two layers OSS baselines typically don't ship, and they're where the engineering time goes. When OSS wins →

All posts