Skip to main content
All posts
Engineering12 min read

Multi-Tenant AI Memory at 10k Users: A Production Playbook

Memory works great in the demo. Then you have 10,000 paying users, and the same patterns that shipped your MVP become a series of expensive failure modes. A walk through the five problems that show up at scale — isolation, noisy neighbors, recall cost, GDPR delete, and per-tenant budget — and what each layer of the stack should do about them.

The shift from one user to ten thousand

When you ship the first version of an agent with memory, you store everything in one place. One vector index, one graph, one set of facts. The agent recalls well. Demos go great. You ship.

Around the time your product crosses 1,000 active users, recall quality starts drifting. By 10,000 users, you have five distinct problems that all look like "the memory is broken," and none of them have the same root cause. Worth knowing the failure modes before you hit them.

Problem 1 — Isolation

The first thing that breaks is also the most embarrassing: user A's memory leaks into user B's recall.

Usually because the single-tenant prototype filtered by user_idin the application layer, not the storage layer. Application-layer filtering is fine until it's not — a missed filter, a default query, a background job, a debug command, a developer running an unfiltered SELECT against production. Any one of those can surface a fact that belongs to someone else.

What to do about it: push the tenant boundary down to the storage layer. The query API should require user_idas a parameter and refuse to run without it. The vector index should be physically partitioned per tenant (or per tenant cohort if the user count makes that impossible). The graph database should enforce per-tenant subgraphs. Treat "no tenant specified" as a 400-class error, not a default-to-all behavior.

What "done" looks like:a pen-test where the tester is given a valid API key for user B and tries every documented and undocumented endpoint trying to surface user A's data. The test passes when no endpoint returns cross-tenant data, even when intentionally misused.

Problem 2 — Noisy neighbors

Tenant 47 has 8 million memories. Tenant 482 has 12. Tenant 47 hammers your write endpoint at 200 req/s while tenant 482 makes a recall call once an hour. Whose latency suffers when 47 goes off?

In a naive setup: everyone's. Background extraction queues fill up, the embedding service backpressures, the graph writer locks. Tenant 482's rare recall call takes 4 seconds because tenant 47 is in the middle of an 8M-row backfill.

What to do about it: per-tenant rate limits at the API edge. Per-tenant queues for background work (separate workers per tenant cohort, not a global FIFO). A per-tenant token bucket on embedding calls. Recall queries should bound their scan size and timeout rather than fail open.

What "done" looks like:p99 recall latency for any tenant is independent of any other tenant's write volume. You can demonstrate this with a synthetic load test where one synthetic tenant writes at peak rate while another reads at low rate; the reader's p99 doesn't move.

Problem 3 — Recall cost growing super-linearly

Recall at 1k users is "query the index, return top-k." Recall at 10k users where each user has 50k facts is the same shape but ~500× more data — and naive vector retrieval scales roughly with corpus size. You can hit seconds-per-recall and a per-call cost that breaks unit economics before you notice.

The bigger problem isn't the recall latency. It's the cost stack: every recall call burns embedding compute for the query, retrieval compute against the index, optional rerank compute on candidates, and prompt-token cost when the recalled context gets passed to the model. Multiply by N agent turns per user per day, multiply by tenant count, you have your AWS bill.

What to do about it:

  • Cache recall responses with tenant-scoped keys. A high cache-hit rate for a chat assistant is normal — users ask similar questions in the same session.
  • Cap recall top-k aggressively and rely on rerank to compensate for the smaller candidate set rather than bumping k.
  • Recall only when the model actually needs it. Agent loops that always call recall() at every turn waste 80% of the budget on calls the model would have answered without context.
  • Move from per-fact embedding to per-fact-cluster embedding for static facts that don't change after extraction — fewer vectors, smaller index, cheaper retrieval.

What "done" looks like: your cost-per-active-user metric is flat or declining over the quarter as tenant count grows, not climbing.

Problem 4 — GDPR delete that's actually complete

User clicks "delete my data." What happens?

In an unprepared system: a row is flagged in the primary store. The vector index still has the user's embeddings. The graph has the entity nodes and edges. The derived wiki pages still summarize the deleted facts. Cached recall responses still contain the deleted content. The audit log retained for compliance still has the raw messages. You've technically failed the "right to be forgotten" requirement.

What to do about it: design delete as a propagating operation across every storage layer from day one. A delete request kicks off a saga that:

  • Removes the rows from the primary memory store
  • Removes the embeddings from the vector index
  • Removes the entity nodes (and any node now orphaned by edge removal) from the graph
  • Invalidates and regenerates the derived wiki pages
  • Purges the cached recall responses scoped to this user
  • Tombstones the audit log entries so they're irretrievable but the audit shape is preserved for compliance
  • Logs the delete itself as an immutable record of which user, what data, when — with no PII

Most of this happens asynchronously. The user-facing promise is "within 72 hours" (or whatever your DPA says), not "instantly."

What "done" looks like:a tester deletes a user, then queries every storage layer and every cache. After the SLA window, no system returns the deleted user's content. The audit log shows the delete event but not the deleted content.

Problem 5 — Per-tenant budget visibility

Tenant 47 (the noisy one from problem 2) is also unprofitable. You wouldn't know that until you have per-tenant cost visibility, which most early-stage memory stacks don't have. Without it, your gross-margin number is the average across every tenant — useful for board slides, useless for operating.

What to do about it: attribute compute costs to tenants at the call-path level.

  • Tag every embedding call, every retrieval call, every LLM call (if you do server-side extraction) with tenant_id
  • Aggregate nightly into per-tenant cost rows
  • Expose a per-tenant cost dashboard internally
  • Enforce a per-tenant budget cap (with a configurable soft and hard cap) so a runaway tenant can't take the rest of your margin with them

What "done" looks like:you can answer "is tenant X profitable at their current plan?" in less than a minute. You can name your three most expensive tenants on demand. A runaway tenant triggers an alert at 80% of cap and a soft cap at 100% before they hit the hard cap.

Who owns what — the stack division

Of these five problems, how many should sit in your application code vs the memory layer you're using? Honest answer: it depends on what the memory layer commits to. Here's the split worth asking about when evaluating any layer:

  • Isolation — should live in the memory layer. If user_idis a parameter and the layer can't demonstrate physical or strong logical partitioning under it, you own this problem.
  • Noisy neighbors — should mostly live in the memory layer (rate limits, per-tenant queues), with the application managing call shape. If the memory layer shares a single embedding service across all tenants with no isolation, you own this problem.
  • Recall cost — split. Caching can live in either layer; the application controls when to call recall at all. The memory layer owns retrieval-stack efficiency.
  • GDPR delete — should live in the memory layer end-to-end (delete should propagate across every system the layer manages). The application is responsible for issuing the delete; the layer is responsible for completing it.
  • Per-tenant budget — split. The memory layer should expose per-tenant cost metrics; the application owns the business logic around plan enforcement.

A six-question evaluation checklist

Before you commit to a memory layer for a 10k+-user product, ask the vendor (or the OSS docs) for direct answers:

  1. Is tenant isolation enforced at the storage layer? Show me the partitioning scheme.
  2. What happens to per-tenant p99 recall latency when one tenant writes 10k facts/minute? Show me the load test or commit to running one.
  3. What's the per-1k-recall cost breakdown — embedding, retrieval, rerank, optional LLM enrichment — and how does it scale with corpus size per tenant?
  4. What's the SLA for a hard delete, and which derived systems does the delete propagate through? Show me a diagram.
  5. Can I see per-tenant cost as a first-class metric, or do I have to derive it from logs?
  6. What's the contract for "graceful failure when a tenant exceeds budget"? (Drop, queue, alert, error?)

The answers separate products designed for multi-tenant scale from products that bolted multi-tenancy onto a single-tenant prototype.

Where Ricord stands

We're going to apply our own checklist to ourselves because it's the only honest thing to do:

  • Isolation: user_id is required on every save/recall API call; tenant data is partitioned at the storage layer; no recall path defaults to cross-tenant.
  • Noisy neighbors: per-API-key rate limits live at the edge; background extraction is per-tenant-cohort queued; recall calls are bounded in scan size with a hard timeout.
  • Recall cost: we cache recall responses with tenant-scoped keys; rerank lets us hold top-k small without losing relevance; the dashboard shows per-account recall volume.
  • GDPR delete: a delete propagates to the primary store, vector index, graph, derived wiki pages, and cached recall responses. Audit-log entries tombstone, content irretrievable. SLA documented in the DPA.
  • Per-tenant budget: per-account usage and cost surface in the dashboard; ricord usage exposes it programmatically; hard-cap enforcement is at the plan boundary, not an after-the-fact alert.

If you're building a product that's going to need any of these on day 90, design for them on day 1 and pick a memory layer that already has them. Retrofitting isolation onto a system that was built without it is the most expensive engineering work you can do, and the cost compounds with every tenant.

All posts