Best AI Memory for LlamaIndex (2026)
LlamaIndex's ChatMemoryBuffer keeps the recent turns inside one session. It doesn't remember across sessions, across users, or recall older knowledge by meaning. Six ways to add real memory to LlamaIndex — built-in, hosted, OSS — evaluated honestly.
Why this is two questions, not one
LlamaIndex has built-in memory, and it answers two different problems. Knowing the split is the start of picking the right tool.
- ChatMemoryBuffer (and
ChatSummaryMemoryBuffer) — keeps a token-limited rolling window of the current conversation so the model sees recent turns. It's session memory, not knowledge memory. It resets when the session ends and can't recall older facts by meaning. - Vector memory blocks (
VectorMemory/ the newerMemoryblock system) — back the conversation with a vector store so older turns can be retrieved by similarity. Cross-session if you wire the persistence yourself. You bring your own vector store, your own embeddings, your own entity extraction.
Vector memory is the closer cousin to what hosted memory layers ship — but it's a primitive, not a product. You still decide what to store, how to retrieve it, how to handle contradictions, how to scope per user. The hosted layers below answer those questions out of the box.
The quick answer
If you want a hosted memory layer that drops into a LlamaIndex chat engine as a BaseMemory with one pip install: Ricord. If you're happy wiring your own vector store and persistence on LlamaIndex's memory blocks: built-in VectorMemory. If retrieval quality is your competitive edge: Mem0 OSS. The full matrix is below.
The decision matrix
Nine criteria, six options. The two LlamaIndex built-ins (ChatMemoryBuffer and vector memory) are evaluated separately because they answer different problems.
| Criterion | Ricord | Mem0 | Letta | Cognee | ChatMemoryBuffer | VectorMemory |
|---|---|---|---|---|---|---|
| Keeps the recent chat window (in-session) | ||||||
| Persists memory across sessions | DIY persistence | |||||
| Persists memory across users (multi-tenant) | DIY | DIY | ||||
| Semantic recall of older knowledge | Vector only, BYO embeddings | |||||
| Entity extraction + conflict resolution | Manual | |||||
| Browsable wiki of what was learned | ||||||
| Official LlamaIndex BaseMemory adapter | Community | Built in | Built in | |||
| Cross-client (same memory from Claude Desktop / Cursor) | API only | |||||
| Cost (smallest tier with memory features) | $12/mo annual | $249/mo for graph | Self-host + LLM | $0 OSS / self-host | $0 (built in) | $0 (built in) |
Slot-by-slot — which fits your LlamaIndex build
If you only need the recent window in one session
ChatMemoryBuffer aloneis enough. The model sees recent turns, the buffer trims to a token budget, and there's no "memory" problem because there's no cross-session knowledge to keep.
If you want cross-session recall and you have engineers
Vector memory blocks + your own extraction layer. The built-in vector memory gives you similarity retrieval over past turns. Layer your own persistence, your own extraction prompt, your own contradiction handling, your own namespace-per-user logic. Common in production today — and the one teams come to regret around month three when the contradiction handling gets brittle.
If you want hosted memory that drops into LlamaIndex
Ricord ships an official LlamaIndex integration.RicordChatMemory is a drop-in BaseMemory: it saves every turn to Ricord and injects relevant past context on each call — no glue code, no separate vector store to run.
pip install ricord-llamaindex
import os
from ricord_llamaindex import RicordChatMemory
from llama_index.core.chat_engine import SimpleChatEngine
memory = RicordChatMemory(
api_key=os.environ["RICORD_API_KEY"],
session_id="user-123",
search_limit=5,
)
engine = SimpleChatEngine.from_defaults(memory=memory)
# Every turn is persisted; Ricord injects relevant past context each call
engine.chat("I prefer concise answers and TypeScript examples")And because saved turns become a browsable wiki, you can pull that knowledge straight back into a LlamaIndex index with RicordReader:
from ricord_llamaindex import RicordReader from llama_index.core import VectorStoreIndex reader = RicordReader(api_key=os.environ["RICORD_API_KEY"]) docs = reader.load_data(query="deployment process", limit=20) index = VectorStoreIndex.from_documents(docs)
If retrieval is your product's edge and you want OSS
Mem0 OSS(Apache 2.0) is a clean fit alongside LlamaIndex — Python-first, well-documented, modifiable. You'll spend real time on the production-grade work (conflict resolution, multi-tenant, hard delete). Worth it if retrieval is your product's edge. When OSS wins →
If your agent framework is itself the value-add
Lettais an agent runtime AND a memory layer in one. If your product is shipping a custom agent runtime, Letta gives you both pieces with one architectural decision — at the cost of LlamaIndex's flexibility on the retrieval side.
If you need extraction-pipeline depth + OSS
Cognee (AGPL-3) is the right pick. The extraction pipeline is configurable in ways Ricord and Mem0 hide. Be aware of the AGPL license implications for commercial products. Cognee details →
Why Ricord wins for most LlamaIndex builders
- Official integration, not glue code.
RicordChatMemoryimplements LlamaIndex'sBaseMemory, so it drops into any chat engine. No separate vector store to run, no retrieval plumbing. - Entity extraction + conflict resolution out of the box. The two problems the built-in vector memory forces you to solve yourself, handled at ingest by the hosted layer.
- Per-user scoping is a parameter, not an architecture. Pass
session_idtoRicordChatMemory; the layer handles isolation. No namespace management in your app code. - Two surfaces from one SDK.
RicordChatMemoryfor live recall, andRicordReaderto pull the auto-built wiki back into any LlamaIndex index asDocumentobjects. - Cross-client memory.The same memory your LlamaIndex backend writes is reachable from Claude Desktop, Cursor, and Codex via Ricord's MCP server.
Getting started
Pick the slot. If it's Ricord, the snippet above is your starting point — one install and a BaseMemory. If it's built-in vector memory, follow LlamaIndex's docs and plan the extraction-layer engineering as a quarter of work.
pip install ricord-llamaindex # Get an API key at https://ricord.ai/login?signup=true export RICORD_API_KEY=rc_live_... # Pass RicordChatMemory to SimpleChatEngine.from_defaults(memory=...)