Skip to main content
All posts
Engineering11 min read

The Token Economics of Agent Memory: When a Memory Layer Pays for Itself

Most teams pick a memory layer the way they pick a logger — by feature, not by cost. But the deeper you go on agent products, the more memory becomes a unit-economics decision. Here's the math on when paying for a memory layer actually saves money, with a worked example at 1k, 10k, and 100k MAU.

The shape of the problem

An LLM call is a function. You pay for the input tokens (what you put into the context window) and you pay for the output tokens. The model has no memory of any prior call. If your agent needs to know something it learned yesterday, you have three choices:

  1. Stuff the entire prior conversation back into context (works until your context window fills up; costs you per-token forever)
  2. Re-derive what was learned by reading source material again (works until your sources change; costs you the full reading bill on every turn)
  3. Store distilled facts somewhere and recall them selectively (memory layer)

Option 1 is what most pre-memory agents do by default. Option 2 is what RAG does. Option 3 is what a memory layer is for. The question of this post is: at what scale does option 3 start saving you real money?

The naive baseline: stuff everything in context

Imagine a customer-support agent. On day 1, a new user sends 10 messages. Their conversation history is ~2,500 tokens by the end. Every turn, the agent reads the whole history plus the new message and produces a reply.

On day 30, the same user has had 300 messages of total interaction across many sessions. The history is now ~75,000 tokens. Every turn:

  • Read the system prompt (~500 tokens)
  • Read the full history (~75,000 tokens)
  • Read the new message (~50 tokens)
  • Generate the reply (~200 tokens output)

On a frontier model at $2.50/M input + $10/M output (typical 2026 pricing), that's ~$0.189 per turn on input alone. The user sends 30 turns this month. Their per-user inference bill is ~$5.67/month — and it grows linearly forever.

Prompt caching helps: most providers cache identical prefixes at a 75-90% discount. But caches expire and partial-prefix invalidations (one new message at the end invalidates the next-message cache) are common. Assume best-case 80% cache hit and you're still at ~$1.13/user/month on inference alone.

The memory-augmented baseline

With a memory layer, the same conversation looks like:

  • System prompt (~500 tokens)
  • Recalled context: top-5 relevant facts (~600 tokens)
  • Current short conversation (~500 tokens for the last 5-10 turns)
  • New message (~50 tokens)
  • Reply (~200 tokens output)

Total per turn: ~1,650 input + 200 output = ~$0.006/turn. 30 turns/month = ~$0.18/user/month. Plus the memory layer's own cost (a save + 30 recalls at Ricord's $12/mo annual plan amortized across ~100 active users is roughly $0.18/user/month).

Net per-user cost: ~$0.36/month vs. ~$1.13/month without memory. The memory layer pays for itself by a factor of ~3×.

When the math doesn't favor memory

We're going to be honest: memory is not always a win. Three patterns where the naive approach is fine:

  • One-shot agents.Classifier, transformer, validator — the agent runs once per input and starts fresh. No state worth recalling. Don't add a memory layer.
  • Short-lived sessions.If every user session is <10 turns and history never carries between sessions, you're below the threshold where memory's overhead beats stuffing the window.
  • Tiny user base.At 50 active users, the memory layer's flat fixed cost dominates and the math is wash. The break-even sits somewhere between 100 and 500 MAU depending on conversation shape.

Scaling curves — 1k, 10k, 100k MAU

Same customer-support agent. 30 turns/user/month average. Conversation history grows linearly to 75k tokens by month 6 then plateaus (because real users repeat themselves and the new content rate slows).

At 1,000 MAU

  • Naive (stuff history): ~$1,130/month inference
  • With memory: ~$180/month inference + $359/month memory = $539/month
  • Monthly savings: ~$591

At 10,000 MAU

  • Naive: ~$11,300/month inference
  • With memory: ~$1,800/month inference + ~$2,500/month memory (Plus tier with 10k users) = $4,300/month
  • Monthly savings: ~$7,000

At 100,000 MAU

  • Naive: ~$113,000/month inference
  • With memory: ~$18,000/month inference + ~$12,000/month memory (Max-tier) = $30,000/month
  • Monthly savings: ~$83,000

The savings compound. At 100k users a memory layer is the difference between a million-dollar inference line and a $360k one. That's before you count the qualitative gains (better answers from focused context, lower latency from smaller prompts, easier debugging from a wiki view of what the agent knows).

The hidden costs the math misses

The token math is the floor. The real cost of memoryless agents shows up in less obvious places:

  • Latency tax on every turn. Reading 75k tokens of history is slower than reading 1.6k. On frontier models the difference is 2-4 seconds per turn. Users feel it.
  • Context-window ceiling.Even with 2M- token windows, you hit the ceiling on long users. When you do, you're forced to truncate — and whatever rule you pick for truncation decides what the agent forgets. Memory layers let you make that decision deliberately.
  • Quality cliff from buried context. A relevant fact buried at position 30,000 in a long history is harder for the model to attend to than the same fact in a short, focused recall block. Recall accuracy improves with shorter, denser context.
  • Cache miss propagation. Any new system-prompt change invalidates the cached prefix for every user. The longer your prefix, the more expensive every prompt-engineering iteration becomes.

What this means for your product

Three rules of thumb:

  1. If average user conversation length will exceed ~5k tokens in a typical session, add memory before launch. The math starts favoring memory around there. Retrofitting after you have users is expensive.
  2. If you have 500+ active users and your per-user inference cost is >$0.30/month, run this math against your bill.Most teams discover they're leaving 60-80% of inference spend on the table.
  3. Don't over-recall. The cost-savings depend on top-k staying small (3-7 facts). Agents that call recall() on every turn with k=50 give the savings back. Recall when the model decides it needs context, not on every turn.

Numbers in this post are illustrative

Frontier pricing, conversation shapes, cache hit rates, and provider tiers vary. The math doesn't. To run it against your specific workload:

  1. Look at one week of your inference logs
  2. Measure: average input tokens/turn, output tokens/turn, turns/user, active users
  3. Calculate naive monthly cost
  4. Model the memory-augmented version: 1k system + 600 recall + 500 short history + 50 message = ~2.2k input/turn
  5. Compare

For most B2C and B2B-SaaS agent products at 1k+ MAU, the answer is clear before you finish the calculation. The variables to actually tune in production are which memory layer fits your stack — covered in our evaluation playbook — not whether to add one at all.

Where Ricord fits

We sell a hosted memory layer with a flat-rate paid plan structure ($12/mo annual on Pro). The recurring cost doesn't scale with your usage, which matters for the math above — your inference savings compound, your memory cost stays predictable.

More importantly: the recall block we return is small and focused (top-k=3-7 facts, conflict-resolved, no duplication). That's the variable the cost model is most sensitive to. A memory layer that returns 50-fact dumps gives back the savings; one that returns 5 well-chosen facts compounds them.

All posts