← labs | 09 | locomo-eval
lab 09 | take-home | ~15 min on a laptop, hours at full scale

Stop guessing which memory layer is better. Score it.

LOCOMO (Long Conversations Memory benchmark) is the dataset everyone now cites when comparing memory layers. Each example is a multi-day conversation followed by factual probes the agent should answer correctly only if it actually remembered. This lab is the harness shape plus per-framework configs for Mem0, Letta, and Zep, so you can run a controlled bake-off after the event.

take-home framing Running the full LOCOMO benchmark eats a chunk of an evening and a non-trivial OpenAI bill, so this is the lab nobody finishes in the workshop window. We ship you the runnable shape. Bring it to work Monday, point it at your own production memory layer, and run a real bake-off when you have an hour and a budget.

what LOCOMO is, in one paragraph

LOCOMO (released alongside the LOCOMO paper, dataset on Hugging Face Hub) contains multi-session conversations between two personas, stretched across many days, plus question-answer pairs that probe whether the agent retained earlier facts. The benchmark scores: did your memory layer surface the right fact when the probe asked for it? Single-turn QA accuracy on questions that require multi-turn memory. The 2026 Mem0/Letta/Zep marketing all benchmarks here; standardized at last.

"LOCOMO is the long-conversation memory benchmark we needed. It rewards systems that remember; it punishes systems that hallucinate around a forgotten fact."
arXiv:2402.17753 (LOCOMO) + 2026 follow-ups across Mem0, Letta, Zep blogs
step 1

Pull the dataset.

The canonical dataset lives on Hugging Face. datasets handles the download + caching.

install
pip install datasets openai mem0ai letta-client zep-cloud
load_locomo.py
import sys
sys.stdout.reconfigure(encoding="utf-8")

from datasets import load_dataset

# the dataset id may differ slightly across mirrors; check the LOCOMO repo
# README for the current canonical path. snorkelai/locomo-10 is a common
# mirror that ships a 10-conversation slice ideal for a take-home run.
ds = load_dataset("snorkelai/locomo-10", split="test")

print(f"loaded {len(ds)} examples")
print("fields:", ds.features.keys() if hasattr(ds, "features") else "see schema")

example = ds[0]
print("\n--- EXAMPLE 0 ---")
print(f"sessions: {len(example.get('sessions', []))}")
qas = example.get("qa") or example.get("questions") or []
print(f"questions: {len(qas)}")
if qas:
    print(f"first Q: {qas[0]}")

If snorkelai/locomo-10 is missing, alternatives include locomo-team/locomo or the dataset card linked from the LOCOMO arXiv page. Schema names ("sessions" / "qa" / "questions") vary by mirror; the code above peeks both common shapes.

step 2

Adapter interface: every memory layer becomes one function pair.

To eval N frameworks fairly, wrap each behind the same two methods: load_sessions(sessions) and answer(question). The bench script does not care what is under the hood.

adapters.py
import os
import sys
import time
import uuid

sys.stdout.reconfigure(encoding="utf-8")


class MemoryAdapter:
    name = "base"

    def load_sessions(self, sessions: list[dict]) -> None:
        raise NotImplementedError

    def answer(self, question: str) -> str:
        raise NotImplementedError


# --- Mem0 adapter -------------------------------------------------
class Mem0Adapter(MemoryAdapter):
    name = "mem0"

    def __init__(self) -> None:
        from mem0 import Memory
        from openai import OpenAI
        self.m = Memory()
        self.client = OpenAI()
        self.user_id = f"locomo-{uuid.uuid4().hex[:6]}"

    def load_sessions(self, sessions: list[dict]) -> None:
        for sess in sessions:
            for turn in sess.get("dialog", sess.get("turns", [])):
                speaker = turn.get("speaker", "user")
                text = turn.get("text", turn.get("content", ""))
                self.m.add(f"{speaker}: {text}", user_id=self.user_id)

    def answer(self, question: str) -> str:
        hits = self.m.search(question, user_id=self.user_id, limit=8)
        ctx = "\n".join(f"- {r['memory']}" for r in hits.get("results", []))
        resp = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": f"Answer using only the facts below.\n{ctx}"},
                {"role": "user", "content": question},
            ],
        )
        return resp.choices[0].message.content


# --- Letta adapter ------------------------------------------------
class LettaAdapter(MemoryAdapter):
    name = "letta"

    def __init__(self) -> None:
        from letta_client import Letta
        self.client = Letta(token=os.environ["LETTA_API_KEY"])
        self.agent = self.client.agents.create(
            name=f"locomo-{uuid.uuid4().hex[:6]}",
            memory_blocks=[{"label": "persona", "value": "You answer based on stored history."}],
            model="openai/gpt-4o-mini",
        )

    def load_sessions(self, sessions: list[dict]) -> None:
        for sess in sessions:
            for turn in sess.get("dialog", sess.get("turns", [])):
                text = turn.get("text", turn.get("content", ""))
                self.client.agents.messages.create(
                    agent_id=self.agent.id,
                    messages=[{"role": "user", "content": text}],
                )

    def answer(self, question: str) -> str:
        r = self.client.agents.messages.create(
            agent_id=self.agent.id,
            messages=[{"role": "user", "content": question}],
        )
        # API returns a list of assistant messages; grab the last text.
        for msg in reversed(r.messages):
            if getattr(msg, "message_type", None) == "assistant_message":
                return getattr(msg, "content", "")
        return ""


# --- Zep adapter --------------------------------------------------
class ZepAdapter(MemoryAdapter):
    name = "zep"

    def __init__(self) -> None:
        from zep_cloud.client import Zep
        from openai import OpenAI
        self.zep = Zep(api_key=os.environ["ZEP_API_KEY"])
        self.client = OpenAI()
        self.user_id = f"locomo-{uuid.uuid4().hex[:6]}"
        self.session_id = f"sess-{uuid.uuid4().hex[:6]}"
        self.zep.user.add(user_id=self.user_id)
        self.zep.memory.add_session(session_id=self.session_id, user_id=self.user_id)

    def load_sessions(self, sessions: list[dict]) -> None:
        for sess in sessions:
            messages = []
            for turn in sess.get("dialog", sess.get("turns", [])):
                speaker = turn.get("speaker", "user")
                text = turn.get("text", turn.get("content", ""))
                messages.append({"role": speaker, "role_type": "user", "content": text})
            if messages:
                self.zep.memory.add(session_id=self.session_id, messages=messages)
        time.sleep(10)  # async graph extraction

    def answer(self, question: str) -> str:
        results = self.zep.graph.search(user_id=self.user_id, query=question, limit=8)
        ctx = "\n".join(f"- {r.fact}" for r in (results.edges or []))
        resp = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": f"Answer using only the facts below.\n{ctx}"},
                {"role": "user", "content": question},
            ],
        )
        return resp.choices[0].message.content
step 3

The bench script.

Loop the dataset, run each adapter, compare predicted answer to ground truth via a simple LLM judge. Output one number per framework: accuracy.

bench.py
import sys
sys.stdout.reconfigure(encoding="utf-8")

from datasets import load_dataset
from openai import OpenAI

from adapters import Mem0Adapter, LettaAdapter, ZepAdapter


def judge(predicted: str, gold: str, question: str) -> bool:
    """LLM judge: is the predicted answer factually consistent with gold?"""
    client = OpenAI()
    prompt = (
        "You are a strict grader. Given a question, a gold answer, and a "
        "predicted answer, output exactly one word: YES if the prediction "
        "is factually consistent with gold, otherwise NO.\n\n"
        f"Question: {question}\nGold: {gold}\nPrediction: {predicted}"
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return resp.choices[0].message.content.strip().upper().startswith("YES")


def run_adapter(adapter, example) -> tuple[int, int]:
    sessions = example.get("sessions", [])
    qas = example.get("qa") or example.get("questions") or []
    adapter.load_sessions(sessions)

    correct = 0
    total = 0
    for qa in qas:
        question = qa.get("question") or qa.get("q") or ""
        gold = qa.get("answer") or qa.get("a") or ""
        if not question:
            continue
        pred = adapter.answer(question)
        if judge(pred, gold, question):
            correct += 1
        total += 1
        print(f"  [{adapter.name}] {'OK ' if judge(pred, gold, question) else 'NO '} {question[:60]}")
    return correct, total


if __name__ == "__main__":
    ds = load_dataset("snorkelai/locomo-10", split="test")
    # subset for a take-home run; bump for a real eval.
    examples = list(ds)[:3]

    scores = {}
    for AdapterCls in (Mem0Adapter, LettaAdapter, ZepAdapter):
        adapter = AdapterCls()
        c, t = 0, 0
        for ex in examples:
            ec, et = run_adapter(adapter, ex)
            c += ec
            t += et
        scores[adapter.name] = (c, t)

    print("\n--- RESULTS ---")
    for name, (c, t) in scores.items():
        acc = c / t if t else 0.0
        print(f"  {name:8s}  {c}/{t}  acc={acc:.3f}")

Run it:

terminal
python bench.py
expected output shape
  [mem0] OK  Who is Alice's roommate in college?
  [mem0] NO  What did Bob say about his job interview?
  ...
  [letta] OK Who is Alice's roommate in college?
  ...
  [zep] OK Who is Alice's roommate in college?
  ...

--- RESULTS ---
  mem0      14/20  acc=0.700
  letta     16/20  acc=0.800
  zep       17/20  acc=0.850

Numbers are illustrative; your run will differ. The point of the harness is to make the comparison reproducible on your data, not to reproduce a leaderboard.

step 4

Per-framework env vars.

You need all three keys to run all three adapters. Drop one adapter from the bench loop if you only want to eval one.

env | mem0
export OPENAI_API_KEY="sk-..."
# Mem0 stores locally by default; for prod cloud add:
# export MEM0_API_KEY="..."
env | letta
export LETTA_API_KEY="lk-..."           # from app.letta.com
export OPENAI_API_KEY="sk-..."          # the agent's model
env | zep
export ZEP_API_KEY="z_..."              # from getzep.com
export OPENAI_API_KEY="sk-..."          # the synth model for the final answer

what counts as a meaningful score

One run on three examples is a smoke test, not an eval. To say something defensible: at least 50 examples per framework, two random subsets per framework (variance check), and a fixed seed so re-runs are comparable. Budget roughly $10-30 of OpenAI tokens for a 50-example run across three frameworks. Latency varies hugely; Zep is slow on writes (async graph extraction) but fast on reads; Letta is the inverse; Mem0 is fastest end-to-end but the simplest model.

The honest framing: LOCOMO accuracy is one number among many. Cost per write, p95 read latency, and operational simplicity also matter. The harness above only measures recall accuracy. For production tradeoffs, instrument time.perf_counter() around each .add() and .answer() call and chart latency too.

going further | swap the dataset

The same adapter interface works against any multi-session conversational eval. LongMemEval and PerLTQA are sibling datasets with different probe styles. Point load_dataset at the alternate path and the bench loop survives because the adapters never knew the schema in the first place.

troubleshooting

"DatasetNotFoundError: snorkelai/locomo-10". Mirror naming has drifted. Browse huggingface.co/datasets for the current LOCOMO paths; common variants include locomo, long-conversations-memory. The schema fields the adapters key into ("sessions", "dialog", "qa") are stable across mirrors.

Letta agent creation rate-limits. Free tier limits agent creation per hour. Reuse one agent across examples by lifting the self.client.agents.create(...) call out of __init__ and into a fixture you reset between runs.

LLM judge is unreliable. Single-call YES/NO is noisy at the margin. For more robust scoring: call the judge twice with swapped argument order, only count YES if both agree. Cost doubles; signal stabilizes.

Predicted answer is too long for the judge to parse. Cap adapter outputs at ~200 tokens via the adapter prompt ("answer in one sentence") so the judge sees focused comparisons.