LOCOMO (Long Conversations Memory benchmark) is the dataset everyone now cites when comparing memory layers. Each example is a multi-day conversation followed by factual probes the agent should answer correctly only if it actually remembered. This lab is the harness shape plus per-framework configs for Mem0, Letta, and Zep, so you can run a controlled bake-off after the event.
LOCOMO (released alongside the LOCOMO paper, dataset on Hugging Face Hub) contains multi-session conversations between two personas, stretched across many days, plus question-answer pairs that probe whether the agent retained earlier facts. The benchmark scores: did your memory layer surface the right fact when the probe asked for it? Single-turn QA accuracy on questions that require multi-turn memory. The 2026 Mem0/Letta/Zep marketing all benchmarks here; standardized at last.
The canonical dataset lives on Hugging Face. datasets handles the download + caching.
pip install datasets openai mem0ai letta-client zep-cloud
import sys
sys.stdout.reconfigure(encoding="utf-8")
from datasets import load_dataset
# the dataset id may differ slightly across mirrors; check the LOCOMO repo
# README for the current canonical path. snorkelai/locomo-10 is a common
# mirror that ships a 10-conversation slice ideal for a take-home run.
ds = load_dataset("snorkelai/locomo-10", split="test")
print(f"loaded {len(ds)} examples")
print("fields:", ds.features.keys() if hasattr(ds, "features") else "see schema")
example = ds[0]
print("\n--- EXAMPLE 0 ---")
print(f"sessions: {len(example.get('sessions', []))}")
qas = example.get("qa") or example.get("questions") or []
print(f"questions: {len(qas)}")
if qas:
print(f"first Q: {qas[0]}")
If snorkelai/locomo-10 is missing, alternatives include locomo-team/locomo or the dataset card linked from the LOCOMO arXiv page. Schema names ("sessions" / "qa" / "questions") vary by mirror; the code above peeks both common shapes.
To eval N frameworks fairly, wrap each behind the same two methods: load_sessions(sessions) and answer(question). The bench script does not care what is under the hood.
import os
import sys
import time
import uuid
sys.stdout.reconfigure(encoding="utf-8")
class MemoryAdapter:
name = "base"
def load_sessions(self, sessions: list[dict]) -> None:
raise NotImplementedError
def answer(self, question: str) -> str:
raise NotImplementedError
# --- Mem0 adapter -------------------------------------------------
class Mem0Adapter(MemoryAdapter):
name = "mem0"
def __init__(self) -> None:
from mem0 import Memory
from openai import OpenAI
self.m = Memory()
self.client = OpenAI()
self.user_id = f"locomo-{uuid.uuid4().hex[:6]}"
def load_sessions(self, sessions: list[dict]) -> None:
for sess in sessions:
for turn in sess.get("dialog", sess.get("turns", [])):
speaker = turn.get("speaker", "user")
text = turn.get("text", turn.get("content", ""))
self.m.add(f"{speaker}: {text}", user_id=self.user_id)
def answer(self, question: str) -> str:
hits = self.m.search(question, user_id=self.user_id, limit=8)
ctx = "\n".join(f"- {r['memory']}" for r in hits.get("results", []))
resp = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Answer using only the facts below.\n{ctx}"},
{"role": "user", "content": question},
],
)
return resp.choices[0].message.content
# --- Letta adapter ------------------------------------------------
class LettaAdapter(MemoryAdapter):
name = "letta"
def __init__(self) -> None:
from letta_client import Letta
self.client = Letta(token=os.environ["LETTA_API_KEY"])
self.agent = self.client.agents.create(
name=f"locomo-{uuid.uuid4().hex[:6]}",
memory_blocks=[{"label": "persona", "value": "You answer based on stored history."}],
model="openai/gpt-4o-mini",
)
def load_sessions(self, sessions: list[dict]) -> None:
for sess in sessions:
for turn in sess.get("dialog", sess.get("turns", [])):
text = turn.get("text", turn.get("content", ""))
self.client.agents.messages.create(
agent_id=self.agent.id,
messages=[{"role": "user", "content": text}],
)
def answer(self, question: str) -> str:
r = self.client.agents.messages.create(
agent_id=self.agent.id,
messages=[{"role": "user", "content": question}],
)
# API returns a list of assistant messages; grab the last text.
for msg in reversed(r.messages):
if getattr(msg, "message_type", None) == "assistant_message":
return getattr(msg, "content", "")
return ""
# --- Zep adapter --------------------------------------------------
class ZepAdapter(MemoryAdapter):
name = "zep"
def __init__(self) -> None:
from zep_cloud.client import Zep
from openai import OpenAI
self.zep = Zep(api_key=os.environ["ZEP_API_KEY"])
self.client = OpenAI()
self.user_id = f"locomo-{uuid.uuid4().hex[:6]}"
self.session_id = f"sess-{uuid.uuid4().hex[:6]}"
self.zep.user.add(user_id=self.user_id)
self.zep.memory.add_session(session_id=self.session_id, user_id=self.user_id)
def load_sessions(self, sessions: list[dict]) -> None:
for sess in sessions:
messages = []
for turn in sess.get("dialog", sess.get("turns", [])):
speaker = turn.get("speaker", "user")
text = turn.get("text", turn.get("content", ""))
messages.append({"role": speaker, "role_type": "user", "content": text})
if messages:
self.zep.memory.add(session_id=self.session_id, messages=messages)
time.sleep(10) # async graph extraction
def answer(self, question: str) -> str:
results = self.zep.graph.search(user_id=self.user_id, query=question, limit=8)
ctx = "\n".join(f"- {r.fact}" for r in (results.edges or []))
resp = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": f"Answer using only the facts below.\n{ctx}"},
{"role": "user", "content": question},
],
)
return resp.choices[0].message.content
Loop the dataset, run each adapter, compare predicted answer to ground truth via a simple LLM judge. Output one number per framework: accuracy.
import sys
sys.stdout.reconfigure(encoding="utf-8")
from datasets import load_dataset
from openai import OpenAI
from adapters import Mem0Adapter, LettaAdapter, ZepAdapter
def judge(predicted: str, gold: str, question: str) -> bool:
"""LLM judge: is the predicted answer factually consistent with gold?"""
client = OpenAI()
prompt = (
"You are a strict grader. Given a question, a gold answer, and a "
"predicted answer, output exactly one word: YES if the prediction "
"is factually consistent with gold, otherwise NO.\n\n"
f"Question: {question}\nGold: {gold}\nPrediction: {predicted}"
)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0,
)
return resp.choices[0].message.content.strip().upper().startswith("YES")
def run_adapter(adapter, example) -> tuple[int, int]:
sessions = example.get("sessions", [])
qas = example.get("qa") or example.get("questions") or []
adapter.load_sessions(sessions)
correct = 0
total = 0
for qa in qas:
question = qa.get("question") or qa.get("q") or ""
gold = qa.get("answer") or qa.get("a") or ""
if not question:
continue
pred = adapter.answer(question)
if judge(pred, gold, question):
correct += 1
total += 1
print(f" [{adapter.name}] {'OK ' if judge(pred, gold, question) else 'NO '} {question[:60]}")
return correct, total
if __name__ == "__main__":
ds = load_dataset("snorkelai/locomo-10", split="test")
# subset for a take-home run; bump for a real eval.
examples = list(ds)[:3]
scores = {}
for AdapterCls in (Mem0Adapter, LettaAdapter, ZepAdapter):
adapter = AdapterCls()
c, t = 0, 0
for ex in examples:
ec, et = run_adapter(adapter, ex)
c += ec
t += et
scores[adapter.name] = (c, t)
print("\n--- RESULTS ---")
for name, (c, t) in scores.items():
acc = c / t if t else 0.0
print(f" {name:8s} {c}/{t} acc={acc:.3f}")
Run it:
python bench.py
[mem0] OK Who is Alice's roommate in college? [mem0] NO What did Bob say about his job interview? ... [letta] OK Who is Alice's roommate in college? ... [zep] OK Who is Alice's roommate in college? ... --- RESULTS --- mem0 14/20 acc=0.700 letta 16/20 acc=0.800 zep 17/20 acc=0.850
Numbers are illustrative; your run will differ. The point of the harness is to make the comparison reproducible on your data, not to reproduce a leaderboard.
You need all three keys to run all three adapters. Drop one adapter from the bench loop if you only want to eval one.
export OPENAI_API_KEY="sk-..." # Mem0 stores locally by default; for prod cloud add: # export MEM0_API_KEY="..."
export LETTA_API_KEY="lk-..." # from app.letta.com export OPENAI_API_KEY="sk-..." # the agent's model
export ZEP_API_KEY="z_..." # from getzep.com export OPENAI_API_KEY="sk-..." # the synth model for the final answer
One run on three examples is a smoke test, not an eval. To say something defensible: at least 50 examples per framework, two random subsets per framework (variance check), and a fixed seed so re-runs are comparable. Budget roughly $10-30 of OpenAI tokens for a 50-example run across three frameworks. Latency varies hugely; Zep is slow on writes (async graph extraction) but fast on reads; Letta is the inverse; Mem0 is fastest end-to-end but the simplest model.
The honest framing: LOCOMO accuracy is one number among many. Cost per write, p95 read latency, and operational simplicity also matter. The harness above only measures recall accuracy. For production tradeoffs, instrument time.perf_counter() around each .add() and .answer() call and chart latency too.
The same adapter interface works against any multi-session conversational eval. LongMemEval and PerLTQA are sibling datasets with different probe styles. Point load_dataset at the alternate path and the bench loop survives because the adapters never knew the schema in the first place.
"DatasetNotFoundError: snorkelai/locomo-10". Mirror naming has drifted. Browse huggingface.co/datasets for the current LOCOMO paths; common variants include locomo, long-conversations-memory. The schema fields the adapters key into ("sessions", "dialog", "qa") are stable across mirrors.
Letta agent creation rate-limits. Free tier limits agent creation per hour. Reuse one agent across examples by lifting the self.client.agents.create(...) call out of __init__ and into a fixture you reset between runs.
LLM judge is unreliable. Single-call YES/NO is noisy at the margin. For more robust scoring: call the judge twice with swapped argument order, only count YES if both agree. Cost doubles; signal stabilizes.
Predicted answer is too long for the judge to parse. Cap adapter outputs at ~200 tokens via the adapter prompt ("answer in one sentence") so the judge sees focused comparisons.