There are two ways to win a memory benchmark. Spend more tokens, or retrieve smarter.
We chose the second. Contextier scored 91.6% on the full 500-question LongMemEval benchmark using 2,807 tokens of context per query on average, with answer generation running on the lowest-cost model tier available.
Put that next to the other serious memory systems on the same benchmark. Zep reports 90.2% at 4,408 tokens. Mem0’s latest algorithm reports 93.4% at 6,787 tokens. We beat Zep on accuracy while using fewer tokens, and we trail Mem0 by 1.8 points while using less than half their context on a cheaper model.
Best accuracy per token of the three. That trade matters more than the headline number, and we will show you why.
What LongMemEval Actually Tests
LongMemEval is an ICLR 2025 benchmark built to measure long-term memory in AI assistants. It is harder than most memory benchmarks because it does not reward dumping the whole conversation into context. It tests whether a system can extract the right facts, store them without losing or corrupting them, and retrieve exactly what a question needs across many sessions.
It spans 500 questions across six categories:
- single-session-assistant: recalling what the assistant said earlier
- single-session-user: recalling what the user said earlier
- single-session-preference: applying a stated preference
- multi-session: connecting facts spread across separate conversations
- temporal-reasoning: reasoning about when things happened and in what order
- knowledge-update: handling facts that change over time
The last three are where most memory systems fall apart.
The Numbers, Category by Category
| Category | Accuracy |
|---|---|
| single-session-assistant | 98.2% |
| single-session-preference | 96.7% |
| single-session-user | 94.3% |
| temporal-reasoning | 91.0% |
| multi-session | 90.2% |
| knowledge-update | 85.9% |
| Overall | 91.6% |
Temporal reasoning, the category we flagged as our weakest a few months ago, now sits at 91.0%. Multi-session recall, which requires stitching evidence together across separate conversations, is at 90.2%. That is the category that separates us from the field: Mem0 reports 86.5% and Zep reports 83.5% on multi-session. On the single hardest “connect the dots across sessions” category, we lead both.
The Part the Headline Number Hides: Token Cost
Accuracy is only half the story. The other half is what you pay for each answer.
| System | Accuracy | Reported tokens per query |
|---|---|---|
| Mem0 (latest) | 93.4% | 6,787 (mean) |
| Contextier | 91.6% | 2,807 (mean), 2,854 (median) |
| Zep | 90.2% | 4,408 (median) |
We use the smallest context budget of the three. Against Zep, that means higher accuracy and fewer tokens. Against Mem0, it means we give up 1.8 points of accuracy to spend less than half the tokens, on a cheaper model.
Every token in the context window is money and latency. At scale, a system that needs 6,787 tokens per query costs more than twice as much to run as one that needs 2,807, and it responds slower. Multiply that across millions of requests and the gap is not academic. For most production workloads, near-frontier accuracy at the lowest context cost is the better deal.
The Model Stack Behind the Numbers
For full transparency, here is the stack that produced these results.
- Extraction:
gpt-5.4-minipulls facts out of each session. Extraction quality is what the whole pipeline rests on, so this is the one place we spend on the mini tier. - Answer generation: defaults to
gpt-5.4-nano, the lowest-cost tier. This is the model reading the retrieved context and producing the final answer. - Embeddings:
text-embedding-3-smallfor vector retrieval. - Judge:
gpt-5.4-miniscores each answer against the gold label. We use the mini tier here because the nano tier was too strict and rejected correct answers, which would have understated our score.
The headline cost story holds because the expensive work is one-time extraction, not per-query answering. You extract a session once and answer against it many times.
How We Got Here
The jump to 91.6% came from changes to the extraction and retrieval pipeline, not from spending more tokens.
Independent extraction per session. We stopped feeding prior facts into the extractor. Each session is extracted on its own, and a separate conflict-resolution pass handles deduplication and supersession afterward. This single change moved knowledge-update from 55% to 86% and multi-session from 65% to 90%. When extraction sees old facts, it anchors to them and misses new information. Isolating each session fixed that.
Date-aware temporal search. Temporal questions were drowning in irrelevant recent facts. We now filter the temporal retrieval strategy to return only facts that carry a date. Less noise reaching the fusion step means cleaner ranking.
Tuned reciprocal rank fusion. We run several retrieval strategies in parallel and fuse them with reciprocal rank fusion. We rebalanced the weights to give more credit to exact keyword matches, which matter for entity-specific questions, and widened the candidate pool each strategy contributes before fusion.
Supersession wired through extraction. When the extractor detects that a fact has changed, it deactivates the old fact and carries the correct validity window forward. Knowledge that updates over time stays correct instead of piling up as contradictions.
The result is a small, high-quality context window. The model sees the right facts instead of a wall of text, which is exactly why we can run on a cheaper model and still compete.
Why Less Context Wins
The instinct is to give the model everything. More context, more chances to find the answer. In practice the opposite is true.
A large context window buries the relevant fact among dozens of distractors. The model hedges, picks the wrong entity, or hallucinates a plausible answer. A small, precise context window gives it exactly what it needs and nothing to trip over.
This is the same lesson our LOCOMO results showed: a handful of well-chosen facts beat a hundred thousand tokens of raw history. LongMemEval confirms it. Better retrieval lets a smaller, cheaper model match systems that lean on far larger context budgets.
Where the Remaining Gap Lives
We are transparent about the questions we still miss. They cluster in three places:
- Temporal multi-hop: questions that require chaining several dated facts together, where the extractor misses one link.
- Off-by-one counting: “how many times did I…” questions where extraction captures most but not all of N items.
- Knowledge-update edge cases: subtle distinctions between similar entities as facts change.
These are extraction-recall problems, not retrieval problems. The facts that get stored are retrieved correctly. The gap is getting every relevant fact out of every message in the first place. That is the next frontier, and closing it is how we make up the remaining points on accuracy without giving up the token advantage.
How This Shows Up in Contextier
The pipeline behind these numbers is the same memory system available in Contextier. Every fact carries temporal bounds, confidence scores, entity links, and supersession history. Retrieval combines graph traversal, vector similarity, and full-text search, fused and reranked before anything reaches the model.
If you are building an AI product that needs to remember across sessions, reason about time, and stay correct as facts change, this is what the foundation looks like. You get near-frontier accuracy without the token bill, and every stored fact is governed, audited, and traceable.
This Is Just the Beginning
91.6% is a milestone, not a destination. We are building the most advanced AI memory system there is, and we are not close to done.
The reason is simple. A model is only as good as the context it is given. The same model, asked the same question, produces a brilliant answer with the right context and a useless one without it. Most of the gap between “impressive demo” and “reliable product” is not the model. It is the memory feeding it. Get the context right and everything downstream gets better: fewer hallucinations, sharper reasoning, answers that actually account for what happened three conversations ago.
That is the bet Contextier is built on. Better memory is the highest-leverage investment you can make in an AI product, and it is where we are pointing everything.
The Takeaway
Memory is not a search problem. It is a knowledge management problem: extract well, resolve conflicts, track time, and retrieve precisely. Do those four things right and you can match the leaders on accuracy while spending half the tokens on a cheaper model.
91.6% on LongMemEval, at 2,807 tokens of context on average, on the lowest-cost model tier. Accuracy is one axis. Cost is the other. We are built to win on both, and we are just getting started.
Want to see how Contextier’s memory system performs on your use case? Reach out at [email protected].