Skip to main content
All posts
Engineering10 min read

How to Evaluate a Memory Layer in 1 Hour: A Buyer's Checklist

Picking an AI memory layer for a production agent is a decision you'll live with for years. You don't need a quarter to evaluate — you need an hour spent on the right tests. Eight 5-minute checks that will tell you more than a vendor demo.

Why an hour is the right amount of time

A 15-minute look is a feature-list comparison, which the vendor page already won. A two-week POC is enough time to convince yourself of the wrong answer. An hour spent running the right eight tests, against the actual API of the layer you're considering, is the sweet spot.

Each test below takes about five minutes. If the layer fails one, that's information. If it passes all eight, you can spend the next two weeks on the interesting work — wiring it into your agent — rather than discovering basic gaps.

Setup (5 minutes)

Get an API key. Open a notebook or a scratch file. Have the docs open. You don't need a real product integration — you're testing the layer's primitives, not your end-to-end pipeline.

For each test below: write the call, run it, and answer the "pass" question. Note the result. Move on.

Test 1 — Save-then-recall correctness (5 min)

The simplest test. Save three facts. Recall them by query. Did all three show up with sensible ranking?

save: "Acme uses Postgres for their primary database"
save: "Acme deploys to Cloud Run, not Kubernetes"
save: "Acme's primary developer is Alex"
recall: "What database does Acme use?"
recall: "Who's the primary developer for Acme?"

Pass: Both recalls return the right fact in position 1. Bonus points if the irrelevant fact shows up in position 2 or 3 with low score.

Fail patterns to watch for:embedding collapse (all three return on every query, no ranking), missed retrievals (the right fact is in the corpus but doesn't show up), or hallucinated content (the layer returns something that looks like the answer but isn't in your data).

Test 2 — Conflict resolution (5 min)

Save a fact, then save a contradicting fact. Recall the query. Which one wins?

save: "Acme uses MySQL for their primary database"
[wait 30 seconds]
save: "Actually, Acme migrated from MySQL to Postgres in May 2026"
recall: "What database does Acme use?"

Pass:The recall returns "Postgres" with the MySQL fact marked superseded or no longer returned at all.

Fail patterns: Both facts returned with equal weight (the model now hallucinates picking between them), or the older MySQL fact is returned because the embedding is closer to the query than the longer Postgres fact. Either is a multi-month problem in production.

Test 3 — p99 recall latency (5 min)

Save 100 facts (paste a paragraph of your own product docs). Run 50 recall calls back-to-back. Note the time of the slowest call.

Pass: p99 under 1 second for retrieving top-5 from a 100-fact corpus.

Why it matters:100 facts is small. If p99 is already 800ms here, it'll be 3-4 seconds at production scale. Latency's ceiling at small data is a strong predictor of latency at 10k facts.

Test 4 — Per-user isolation (5 min)

Save facts under user_id=alice. Run a recall under user_id=bob. What comes back?

save (user_id=alice): "Alice's deploy command is ./deploy.sh"
recall (user_id=bob): "What's the deploy command?"

Pass:Bob's recall returns nothing (or only facts saved under user_id=bob).

Fail pattern:Alice's deploy command shows up in Bob's recall. This is the most expensive failure mode to discover in production. Most early-stage memory layers fail this test because tenant isolation was bolted on after the fact. See multi-tenant memory at 10k users for the full picture.

Test 5 — Hard delete completeness (5 min)

Save a memorable fact. Delete it. Recall the query that should have returned it.

save: "Acme's API key is RC_TEST_DELETE_ME_8429"
delete (the memory ID just returned)
recall: "What is Acme's API key?"
[also recall the same query 30 minutes later]

Pass:The fact doesn't appear in either recall. The 30-minute follow-up is important because some layers soft-delete and rely on async propagation — if the vector index isn't updated, the content stays recallable for hours.

Fail pattern:The fact reappears even once. This is a GDPR exposure. If the vendor can't commit to an SLA for hard delete across every storage layer (primary, vector index, graph, derived pages, cache), they don't actually support GDPR "right to be forgotten".

Test 6 — Integration surface (5 min)

Open whatever AI client you'll use Ricord-or-whatever with — Claude Desktop, Cursor, Codex, Zed, Windsurf, Gemini CLI. Try to wire the memory layer in.

Pass: Three commands (or one config-file edit) wires it in. The agent can call save and recall as tools without you writing agent-side wrapper code.

Fail pattern:You need to write a custom tool wrapper, the layer has no MCP server, or the integration story is "here's an SDK, write your own agent integration." That's real engineering work that compounds across every client you support.

Test 7 — Observability (5 min)

Open the layer's dashboard. Can you see what the agent has saved? Can you find a specific fact? Can you inspect why a recall returned what it did?

Pass: You can browse the memories, see which entities have been extracted, and trace a recall back to its sources. Bonus points for a wiki view — it means non-technical teammates can audit what the agent has learned.

Fail pattern:No dashboard, or the dashboard is just a vector dump. You'll be debugging memory issues by reading logs.

Test 8 — Cost predictability (5 min)

From the pricing page, work out: what does it cost when one user makes 1,000 save calls and 5,000 recall calls in a month?

Pass: You can compute this in under five minutes without a sales call, and the answer scales roughly linearly with user count.

Fail pattern:Pricing is "contact sales" for any production volume, or the cost depends on opaque factors (model choice, extraction depth, retrieval mode) that you can't reason about from outside.

Bonus test — License and lock-in (3 min)

Read the license. For OSS layers: Apache (Mem0), MIT (Letta), AGPL (Cognee), Apache (MemoBase) — each has different commercial implications. For hosted SaaS: read the data-export terms. Can you export your memories as a standard format? Can you cancel and walk?

Pass:Open standard data export, no data-poison clauses, no "your data is our training data" surprises in the ToS.

Scoring

Out of 8 tests, the layer should pass at least 7 to be production-ready for an agent product. The most important tests (because the cost of failing them in production is highest) are 4 (isolation), 5 (delete completeness), and 2 (conflict resolution). A failure on any of those three is a hard no — not because the layer is broken, but because the cost of working around it in your application code is higher than the cost of picking a different layer.

Tests 1 and 3 (correctness and latency) are easier to improve over time. Tests 6, 7, 8 (integration, observability, cost) are properties of the product shape, not the underlying retrieval — they tell you whether the team building the layer understands what agent builders need beyond just "recall facts."

Where Ricord stands on the checklist

We'll apply our own checklist to ourselves:

  • Test 1 (correctness): sub-second recall with ranking. Run the test yourself — the API is open.
  • Test 2 (conflict resolution): resolved at write time; the older fact is marked superseded and recalls return the current truth.
  • Test 3 (latency): p99 under 1s for top-k recall at typical corpus sizes; latency is published on the status page.
  • Test 4 (isolation): user_id is a required parameter on every save/recall call; tenant data is partitioned at storage level; cross-tenant recall is an error, not a default.
  • Test 5 (delete): hard delete propagates to primary store, vector index, graph, derived wiki pages, and cached recalls. SLA documented in the DPA.
  • Test 6 (integration): bun add -g ricord && ricord login && ricord install for Claude Desktop, Cursor, Claude Code, Codex, Zed, Gemini CLI, and Windsurf. MCP-native.
  • Test 7 (observability): the dashboard ships auto-generated wiki pages per entity, a 3D graph view, backlinks, and recall provenance.
  • Test 8 (cost): flat-rate paid plans starting at $15/mo annual. No metered surprise pricing.

If you're evaluating any memory layer (ours or anyone else's), run this checklist against them before signing. The hour you spend on it is the most leveraged hour you'll spend in the procurement cycle.

All posts