All posts
Research8 min read

The State of AI Memory, April 2026

Three papers, six benchmarks, and why latency just became the real story.

April 2026 was the month "agent memory" stopped being a research backwater and became a procurement conversation. In four weeks we got a 47-author survey paper, a new ground-truth-preserving architecture out of arXiv, and three separate vendors claiming state-of-the-art on the same benchmark.

If you are responsible for shipping an agent to production this quarter, here is what actually changed, what is noise, and what to evaluate.

1. The survey that defined the field

Memory in the Age of AI Agents is the first comprehensive survey to treat memory as a first-class architectural component instead of a RAG appendix. Forty-seven authors. A taxonomy that finally distinguishes episodic, semantic, procedural, and profile memory in a way practitioners can use.

The single most useful thing in the survey is the framing: memory is no longer a feature you bolt on with a vector store. It is the layer that decides whether your agent feels like a coworker or a goldfish.

That framing is why every memory startup published a benchmark result in the same month.

2. The benchmark race got crowded — fast

LongMemEval (ICLR 2025) is now the de-facto scoreboard. It tests five abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. 500 questions, conversation contexts up to 115K tokens.

Here is the April 2026 leaderboard, sourced from each vendor's own published numbers:

SystemScore
OMEGA95.4
Mastra Observational94.87
Ricord94.2
Ricord (LoCoMo)93.0
Hindsight (Vectorize)91.4
Supermemory85.2
Letta74.0 (LoCoMo)
Zep / Graphiti63.8
Mem049.0

First, the spread between the top and bottom is now 48 points on the same benchmark. That is no longer a market where everyone is roughly comparable. Memory architecture decisions have become decisive.

Second, the top of the leaderboard is so crowded that benchmark accuracy alone has stopped being a moat. When five systems are within four points of each other, the conversation moves to the next variable.

3. The next variable is latency

Hindsight published the most uncomfortable number in the space last month: Zep recall in production averages around 4 seconds. Mem0 averages 7-8 seconds. At interactive agent volume, that compounds into a UX that feels broken regardless of accuracy.

For comparison, Ricord's recall path is sub-second on the same benchmark traffic. We are not publishing the architecture that gets us there, but the production reality is simple: an agent that recalls in 600ms feels alive; an agent that recalls in 4 seconds feels like Slack DMs to a contractor in another timezone.

Latency is the second axis of the new market. Accuracy gets you on the shortlist. Latency decides whether you stay there.

4. The third variable is what you store, not just what you retrieve

The most interesting paper of the month was MemMachine, which makes an argument the rest of the field has been quietly avoiding: lossy LLM-based extraction is destroying ground truth. When you summarize a conversation into "facts", you lose the timestamps, the qualifiers, the contradictions, and the user's actual phrasing — and then you try to answer questions that depend on exactly those things.

MemMachine's answer is to preserve the entire conversational episode and let retrieval do the work. A-Mem makes a related argument: memory operations should be tools the agent calls, not a pipeline that happens to it.

These two papers are pointing at the same thing from different angles. The systems that win the next round of the benchmark race will be the ones that stop treating extraction as a one-shot lossy compression step.

This is consistent with what we've seen internally. The single biggest lift on Ricord's LongMemEval score this quarter did not come from retrieval changes. It came from being more conservative about what we threw away during ingestion.

5. What to actually evaluate

If you are picking a memory layer this month, the benchmark scores matter less than they did six months ago. Here is the evaluation checklist we'd use if we were the buyer:

  1. LongMemEval-S score with the same model you plan to ship. Vendor numbers using gpt-5-mini are not comparable to your gpt-4o-mini production target.
  2. p50 and p95 recall latency under load. Not the demo number. The number at 100 concurrent users.
  3. What happens to a fact when it gets contradicted. Does the system silently keep both? Deprecate the old one? Surface the conflict? This is the difference between a memory and a junk drawer.
  4. Can you delete a fact and have it actually be gone. GDPR, SOC2, and the principle of not being creepy all depend on this.
  5. Pricing at 100M tokens/month. The free tier is irrelevant. The number that matters is what it costs when you are actually using it.

6. Where Ricord lands

Ricord scored 471/500 (94.2%) on the full LongMemEval 500-question suite — the first time we ran the complete set. Our first 100 questions still hit 98%. On LoCoMo, the second major memory benchmark, we scored 93.0% after a full integrity audit that stripped prompt-level contamination — beating MemMachine's 91.69%. We ship sub-second recall, automatic conflict resolution, hard delete, and graph-aware retrieval on every paid tier. We do not publish the internals because we'd like to keep the lead.

If you want to run the same eval against our API, the keys are free at ricord.ai and the LongMemEval harness is open source. We'd rather you reproduce the number than take our word for it.

Try Ricord

Get a free API key

1K memories free. No credit card. Works with Claude Desktop, Cursor, and every major agent framework through MCP.

Start free

Sources