Skip to slide 1
01 / 42
VCN #33 · Total Recall · 2026-05-20 · Frontier Tower
● live · F10 annex

memory for agentic systems / vcn #33 / the archive that does not forget

Rayyan Zahid · w/ Michalis Vasileiadis · Eric Mockler · Devinder Sodhi
> system boot
> loading memory v.2026
> mapping four wings · one archive
● 200 OK · room F10
first principles · what we even mean

BUILDERS USE THE WORD MEMORY FOR FIVE DIFFERENT THINGS.

01 · context

context-window engineering

Fit more useful tokens in the working set. Compaction, tool-result clearing, prompt cache. Solves cost and latency. Does not solve persistence.

02 · retrieval

external retrieval (RAG)

Pull the right chunks at runtime, inject into context. Vector index or knowledge graph. Brings in facts the model never saw. The next session starts fresh.

03 · state

persistent state across sessions

Facts, preferences, prior conversation summaries that survive a restart. Mem0, Letta, Zep. "Remember that I prefer x" across days.

04 · taxonomy

procedural · episodic · semantic

How to do something. What happened when. What is true. Borrowed from cognitive psychology. Production stacks lump all three; the frontier splits them.

05 · meta

memory-as-a-tool

Anthropic's shift. The model gets a memory tool with read / write verbs and decides when to use it. Storage is markdown outside the context window.

→ research.md §1. confusing these is the most common reason agent stacks break.

the architecture · four wings, one archive

FOUR WINGS.
one archive.

I wing one context

context engineering

Anthropic's memory tool. Compaction at the 180K trigger. Tool-result clearing. Sub-agent delegation. Memory inside the model's working set.

II wing two retrieval

external retrieval

Vector index versus knowledge graph. The 2026 debate. Vectors win on simplicity and speed. Graphs win on temporal reasoning.

III wing three state

persistent state

The center wing. Three Apache-2.0 frameworks worth knowing tonight. Zep. Letta. Mem0. Each has a different shape.

IV wing four procedural

procedural memory

CLAUDE.md, .cursorrules, AGENTS.md. The DIY pattern that quietly turned into the killer use case. No framework. Highest leverage today.

spine of the talk each wing has its own demo / lab walk in order
FIG. 01 the four wings of memory top-down floor plan · vcn-33 scale 1:1 WING 01 CONTEXT compaction · clearing · caching anthropic memory tool WING 02 RETRIEVAL vectors · graphs · embeddings pgvector · zep · graphiti WING 03 STATE facts that survive a restart letta · mem0 · zep · cognee WING 04 PROCEDURAL rules files · the diy that won claude.md · cursorrules · agents.md YOU ARE HERE DRAWN 2026 · BOOM PERSONA · NOT TO SCALE
wing I · context engineering

A MEMORY TOOL + MARKDOWN FILES OUTSIDE THE CONTEXT.

compaction. Claude saves a summary at a configurable token trigger, default near 180K. The working set keeps moving forward.

tool-result clearing. Drop stale tool outputs from the context before they crowd out new ones.

sub-agent delegation. Offload a subtask into a fresh context. The parent agent never sees the noise.

memory-as-tool. A `memory` tool with read / write verbs. Storage is markdown files outside the context. The model decides when to use it.

Anthropic · engineering blog · 2026 Anthropic frames memory as one of four levers for effective context engineering. Compaction, tool-result clearing, sub-agent delegation, and the memory tool. The model engineers its own working set.

→ anthropic.com / engineering / effective-context-engineering-for-ai-agents. solves cost and latency. does not solve persistence across sessions.

wing I · demo card · the recipe

LAB 1. CLAUDE CODE'S MEMORY.MD, LIVE.

A CARD CATALOG FOR YOUR AGENT.init a CLAUDE.md. watch the agent self-curate. cause a compaction. observe the survivor set.

7 minterminal · editor · one repo
Open Lab
  • claude init # scaffolds CLAUDE.md + opens auto-memory
  • # talk to the agent. let it work for 10 minutes.
  • ls .claude/projects/<hash>/memory/ # the agent has been writing
  • cat .claude/projects/<hash>/memory/MEMORY.md # the index it built
  • # force a compaction. see what survives.
  • claude --compact-now # the working set turns over
  • cat .claude/projects/<hash>/memory/MEMORY.md # same index, still there

→ MEMORY.md is the librarian's index card stack. context comes and goes; the catalog persists.

wing II · the 2026 retrieval debate

VECTORS WIN ON SIMPLICITY. GRAPHS WIN ON TIME.

B · vector

VECTOR DB

"give me passages that look like this one."

pgvector, Pinecone, Qdrant, Weaviate, Chroma
similarity over embeddings
simple, fast, default for most stacks
cannot answer "what was true on this date"
C · graph

KNOWLEDGE GRAPH

"give me the chain of facts that held last tuesday."

Zep / Graphiti, Cognee, standalone graph stores
entities, edges, timestamps
temporal reasoning, chains of evidence
heavier write path, async entity extraction

→ arXiv:2602.05665 (graph-based agent memory taxonomy, feb 2026). arXiv:2601.03236 (MAGMA, the "graphs eat vectors" thesis, jan 2026).

wing III · persistent state · the center wing

THREE FRAMEWORKS.
one wing. apache 2.0 all the way down.

III · a · first up

ZEP

temporal graph · graph-owns-truth

Graphiti under the hood. Every edge carries `valid_at` and `invalid_at`. The killer question it answers cleanly: "what was true last tuesday."

III · b · second

LETTA

stateful runtime · agent-owns-blocks

Three tiers. Core (always in context), Recall (conversation history), Archival (vector store). The runtime persists the blocks the agent edits.

III · c · third

MEM0

managed API · api-owns-truth

Three lines. add(), search(), delete(). Vector + graph + reranker hidden behind a managed layer. The substrate choice is theirs, not yours.

→ talk section order is locked: zep first (the temporal case lands cleanest), then letta, then mem0. comparator table follows.

wing III · a · zep / graphiti · the temporal-graph case

EVERY EDGE CARRIES FOUR TIMESTAMPS.

Zep tracks memory in temporal edges where the graph owns the truth about when a fact was valid (per valid_at and invalid_at in graphiti_core/edges.py). The canonical case for graph-beats-vector when you need temporal reasoning. Rasmussen et al. · arXiv:2501.13956 · zep team

"what was true on this date?"

→ Preston Rasmussen, Liu, Liu, Mocrii, Klein, Chalef. *Zep: A Temporal Knowledge Graph Architecture for Agent Memory.* arXiv:2501.13956. nominative-fair-use citation per apache 2.0 §6.

FIG. 02 one edge · four timestamps graphiti_core/edges.py · EntityEdge NODE rayyan NODE frontier tower works_at CREATED_AT 2024-09-01 VALID_AT 2025-03-15 INVALID_AT null · still true EXPIRED_AT null · live "graphs own the truth about WHEN."
wing III · a · zep · the minimal recipe

ADD AN EPISODE. ASK WHAT WAS TRUE.

zep · python client · minimal exampleapache 2.0
from zep_python.client import Zep
from datetime import datetime, timezone

client = Zep(api_key="...")
graph_id = "ray_memory"

# write a temporal episode. the graph extracts entities + edges.
client.graph.add(
    graph_id=graph_id,
    type="message",
    data="Ray prefers jade-teal accents on event posters.",
    reference_time=datetime.now(tz=timezone.utc),
)

# later, in a new session, ask the graph what holds today.
results = client.graph.search(
    graph_id=graph_id,
    query="Ray's color preference for posters",
    search_filters={"valid_at": datetime.now(tz=timezone.utc)},
)

for edge in results.edges:
    print(edge.fact, edge.valid_at, edge.invalid_at)

→ runnable version with full provenance walk-through at /lab/zep. one graph, three episodes, one temporal query. 8 minutes.

wing III · b · letta · the stateful-runtime case

THE AGENT OWNS THE BLOCKS. THE RUNTIME PERSISTS THEM.

Letta's three-tier memory model. Core (always-on), Recall (conversation history), Archival (vector store). Maps to what production agents actually ship versus what builders think they ship. letta.com / blog / agent-memory · cofounded by Sarah Wooders + Charles Packer · MemGPT 2023
I
core memory

always-on

Lives in the system prompt. Edited by the agent itself via tool calls. The persistent "who we are."

II
recall memory

conversation history

Searchable log of past messages. Pulled on demand into the context window.

III
archival memory

vector store

Long-term facts and documents. Tool-callable. Where the cold knowledge lives.

the distinction. zep keeps truth in the graph (graph-owns-truth). letta keeps state in agent blocks (agent-owns-blocks). mem0 keeps it behind an API (api-owns-truth). same problem. three different owners.

FIG. 03 letta · three tiers core · recall · archival HOT COLD TIER 01 · CORE always in context ~2KB · pinned TIER 02 · RECALL searchable history queried on demand TIER 03 · ARCHIVAL vector store on demand unbounded FILL = % LIVE IN CONTEXT WINDOW SOURCE · letta.com/blog/agent-memory
wing III · b · letta · the minimal recipe

SPAWN AN AGENT. EDIT ITS CORE. COME BACK TOMORROW.

letta · python client · minimal exampleapache 2.0
from letta_client import Letta

client = Letta(base_url="http://localhost:8283")

# create an agent. core memory blocks are first-class state.
agent = client.agents.create(
    name="ray_assistant",
    memory_blocks=[
        {"label": "human", "value": "Ray runs Vibe Coding Nights at Frontier Tower."},
        {"label": "persona", "value": "I keep track of what Ray ships."},
    ],
    model="claude-sonnet-4-6",
    embedding="text-embedding-3-small",
)

# the agent can edit its own blocks via tool calls during a turn.
client.agents.messages.create(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "remember that I prefer jade-teal accents."}],
)

# tomorrow. fresh process. same agent. same memory.
client.agents.messages.create(
    agent_id=agent.id,
    messages=[{"role": "user", "content": "what color did I tell you I liked?"}],
)

→ runnable at /lab/letta. spin up a local letta server, create an agent, watch core memory survive a process restart. 10 minutes.

wing III · c · mem0 · the managed-api case

THREE LINES. ZERO SUBSTRATE DECISIONS.

Mem0 wraps vector + graph + reranker behind a managed memory API, so a builder can add(), search(), and forget across sessions without picking a stack. mem0.ai / blog / state-of-ai-agent-memory-2026 · mem0/mem0 readme
add()

Write a message, an event, a preference. The api decides what is fact-worthy and what is chatter.

search()

Ask for relevant memories. Returns ranked facts with source pointers. The substrate is hidden.

delete()

Forget a memory or a whole user. The forgetting policy is explicit, the implementation is not yours to maintain.

→ the pitch is operational simplicity. mem0 hides the substrate choice the other two surface. tradeoff: less control, less to break.

wing III · c · mem0 · the minimal recipe

ADD. SEARCH. THAT IS IT.

mem0 · python client · minimal exampleapache 2.0
from mem0 import Memory

m = Memory()

# write some memories tied to a user.
m.add("Ray prefers jade-teal accents on event posters.", user_id="ray")
m.add("VCN happens wednesdays at 7pm at Frontier Tower.", user_id="ray")
m.add("Ray hosted VCN #32 last week. Topic: tool-use UX.", user_id="ray")

# later, in a new session, retrieve what is relevant.
results = m.search(
    query="what color should the poster be?",
    user_id="ray",
    limit=5,
)

for hit in results["results"]:
    print(hit["memory"], hit["score"])

# forget a memory if it goes stale.
m.delete(memory_id=results["results"][0]["id"])

→ runnable at /lab/mem0. three messages, one search, one delete. the substrate is hidden by design. 5 minutes.

wing IV · procedural memory · the diy pattern that won

NO FRAMEWORK. JUST MARKDOWN. HIGHEST LEVERAGE TODAY.

CLAUDE.md .cursorrules AGENTS.md .windsurfrules .clinerules codex.md rules/

The pattern emerged from nowhere and quietly became table stakes. A static markdown file that tells the agent how this codebase works, what conventions to follow, what to never touch. No vector store. No write path. Just words a human and a model both read.

→ cursor `.cursorrules` was first. claude code `CLAUDE.md` is the canonical agent-readable version. zero infrastructure cost. the most widely-shipped memory pattern of 2026.

wing IV · claude code's auto-memory · the receipt

THE AGENT IS WRITING ITS OWN INDEX CARDS RIGHT NOW.

Claude Code v2.1.59+ ships with auto-memory default on. As the model works, it self-curates entries in MEMORY.md. User preferences. Project conventions. Past mistakes. The librarian builds the catalog while you build the code.

→ first agent to curate its own procedural memory by default. lab: /lab/claude-md. init a new CLAUDE.md in 90 seconds, watch it grow.

the screenshot · pick a layer in 30 seconds

FOUR LAYERS. FOUR JOBS. ONE TABLE.

framework substrate write-policy retrieval-pattern reach for it when
zepgraphiti temporal knowledge graph (valid_at / invalid_at) async entity + edge extraction, with provenance graph traversal + temporal filter you need "what was true on this date" semantics
lettamemgpt v2 three tiers (core / recall / archival) agent edits its own core blocks via tool calls core lives in prompt, archival via search tool you are building agent-native and want one runtime to own state
mem0managed api hidden (vector + graph + reranker) api decides fact-worthiness from messages add() / search() / delete() you have a working agent and want to bolt memory on without restructure
CLAUDE.mdauto-memory static markdown files in the repo agent writes when it notices something worth keeping read on every session start, indexed by topic you want zero-config persistence for a coding agent or single-user assistant

→ the differences are about who owns truth: the graph (zep), the agent (letta), the api (mem0), or the file (CLAUDE.md). pick by ownership, not by feature list.

discord · the slide that breaks the grammar

Your agent's memory is now an attack surface.

arXiv:2604.16548 memory poisoning · live · apr 2026 mnemonic sovereignty

In A Survey on the Security of Long-Term Memory in LLM Agents (apr 2026), researchers formalize a new attack surface. If your agent reads memory from untrusted sources, an attacker can implant a false memory. The agent then acts on a false belief. Worse: it presents the belief as a learned preference of yours. The paper coins the goal: mnemonic sovereignty. The agent's memory has to remain provably yours.

01attacker drops a crafted artifact into a corpus your agent later reads.
02agent ingests it. mem0 / zep / letta has no way to know the source is hostile.
03false fact is now in the store, indistinguishable from your real preferences.
04next session, agent recommends the attacker's shitcoin. cites you as the source.

trust the surface, get exploited.

→ arXiv:2604.16548 · the inversion of this slide IS the attack. nobody is talking about this yet. you should.

open problems · what builders are wrestling with

NONE OF THESE ARE SOLVED. PICK ONE AND BUILD.

forgettingwhen does old become stale. time-decay versus LRU versus salience-scored.
consolidationepisodic logs balloon. how do you compress them into semantic facts without losing edge cases.
conflicttwo memories disagree. user said x last week, y today. newest wins, source-weighted, or ask.
tieringhot facts in context (fast, expensive). cold in vector (slow, cheap). how do you tier without flapping.
lost in the middlelong contexts degrade in the middle. memory has to beat "just throw more in the window."
self-improvingagents updating their own memory schema based on what they keep getting wrong. letta is investing here.
cross-agenttwo agents on the same task. shared memory without race conditions. almost no production answers yet.
evalLOCOMO exists but real-world memory eval is hard. how do you A/B "did the agent get better."

→ research.md §4 · pull these directly from the 2026 frontier papers (arXiv:2603.07670, arXiv:2512.13564). every one is an unbuilt company.

production patterns · already shipping

THE PATTERNS THE FRAMEWORK HEADLINES MISS.

01 · rules files

.cursorrules + CLAUDE.md

Static markdown. Project-scoped behavior. The first widely-adopted procedural memory pattern. Most-shipped memory layer of 2026.

02 · auto-memory

Claude Code MEMORY.md

Default on in v2.1.59+. Agent self-curates. First agent that writes its own procedural memory.

03 · threads

OpenAI Assistants threads

Persistent conversation threads with the API managing context. Lower-level than Letta. Built into the platform.

04 · plan files

Replit Agent + Devin

Plan files persisted across runs. The agent remembers what it already tried so it does not retry doomed paths.

05 · checkpointer

LangGraph PostgresSaver

Process-level resume + time-travel debug. Not memory of knowledge — memory of execution. Pair with Mem0 / Zep for facts.

06 · per-repo

AGENTS.md / codex.md

The OpenAI flavor of the rules-file pattern. One markdown file per repo. Read on every session.

→ research.md §5. the rules-files pattern in general (`.cursorrules`, `CLAUDE.md`, `.windsurfrules`, `AGENTS.md`) is the layer that quietly won for most teams.

pick your layer · decision tree

ONE QUESTION. ONE ANSWER. START THERE.

want zero-config persistence for a coding agent?

CLAUDE.md auto-memory

building agent-native, want one runtime to own loop and state?

Letta

have a working agent, just bolt memory on?

Mem0

need temporal reasoning ("what was true last tuesday")?

Zep / Graphiti

need process-level resume and time-travel debug?

LangGraph checkpointers (pair with one above)

→ the questions are not competing. most production stacks pair a knowledge layer with a process layer. pick by question, not by brand.

tonight · hands-on hour

FOUR LABS. ONE HOUR.

by 10pm, your agent remembers something it didn't at 7.

→ scan the QR on the deck chrome to pair your phone. no urls to memorize. labs run offline-friendly where possible.

vcn cadence · wednesdays · 7pm · frontier tower

THE NEXT SEVEN. ALREADY ON THE CALENDAR.

→ free, builder-only, no pitches. RSVPs on luma.com/vibe-coding-nights. doors 7pm, talks 7:30, social 9 to 10.

hosts · vibe coding nights

THE PEOPLE WHO PUT THIS ROOM TOGETHER.

facilitator

Rayyan Zahid

Immersive Commons. Facilitator of VCN and tonight's speaker.

cohost

Michalis Vasileiadis

Otto / GSD 2.0. AI security and agentic infrastructure operator.

logistics

Eric Mockler

Frontier Tower F11 Health and Longevity. Pre-meet, room flow, food run.

tower lead

Devinder Sodhi

Frontier Tower lead. Booking, building access, the reason the F10 annex exists.

→ thanks to the F10 annex, Frontier Tower house staff, and every builder who showed up with a forgetting bug written down.

your agent remembers what you told it last tuesday.

slide 25 / annex opens here

the live talk ended.
the reference begins.

everything from slide 30 onward is the reference annex. denser. deeper. citation-heavy. written for the hands-on hour and for the link the host sends after the room empties. pick a section. skim it. come back when your agent breaks.

frameworks deep

s27 — s38
  • 27—30zep / graphiti internals
  • 31—34letta agent runtime
  • 35—38mem0 production patterns

context + procedural

s39 — s46
  • 39—42anthropic memory tool, compaction, tool clearing, sub-agent delegation
  • 43—46claude code auto-memory, hooks, rules-file comparator

frontier + eval

s47 — s56
  • 47—48memory eval, locomo benchmark
  • 49multi-agent shared memory
  • 50—51memory poisoning, defense
  • 52—54hybrid stacks, forgetting policies, consolidation
  • 55—56reading list + close
three surfaces · pick one

THE RUNNABLE PART. PICK A LANE.

appendix a · zep deep #1 · the schema

one edge. four lifecycle stamps. a clock per fact.

Zep tracks memory in temporal edges where the graph owns the truth about when a fact was valid (per valid_at and invalid_at in graphiti_core/edges.py). The canonical case for graph-beats-vector when you need temporal reasoning. verbatim · graphiti_core/edges.py + arXiv:2501.13956

created_at is database time. The row was written. It says nothing about the world.

valid_at is event time. When did the fact become true. Often before created_at because the extractor catches up after the conversation happens.

invalid_at is the supersession boundary. When did the fact stop being true. null means the edge is still considered live. A new conflicting episode sets this on the older edge.

expired_at is observation time. When did the system notice the supersession. This is the gap between the world changing and your graph knowing.

the killer query · what was true on this date
matchedges where source_node = ?ray and predicate = works_at
wherevalid_at <= 2025-10-01 AND (invalid_at IS NULL OR invalid_at > 2025-10-01)
returnedge.fact, target_node.name, edge.valid_at

→ Rasmussen et al. Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956. Source code: github.com/getzep/graphiti, file graphiti_core/edges.py. Nominative-fair-use citation per apache 2.0 §6.

appendix a · zep deep #2 · the loop

extraction is eventual. the graph is the consistency boundary.

Every episode you write fires a chain of LLM calls. Entity extraction. Edge extraction. Conflict resolution. Embedding. The graph commits a few seconds after the call returns. Reads against the graph during that window get the old truth, on purpose.

That is roughly 2 to 8 seconds for a single episode against a hot graph. The sleep() calls in the /lab/zep walkthrough are not arbitrary. They are the wait for stage 5 to commit before stage 1 of the next read kicks in. Treat the graph as eventually consistent and design around it.

The model behind stages 02 to 04 is a single LLM call each by default (configurable via llm_client). The graph never sees raw text after extraction. Once stage 02 names a node, every later edge reuses that uuid. The cost of the loop dominates the cost of the storage layer.

retry semantics. stages 02 to 04 are wrapped in tenacity-style exponential backoff. an OpenAI 429 or a malformed JSON response retries up to N times. failures past N are surfaced on the episode row as a status enum, not raised, so a single bad episode never blocks the ingest queue.

→ reference: Rasmussen et al. arXiv:2501.13956 §4 "Pipeline." Source: graphiti_core/graphiti.py::add_episode. Latency numbers approximate, measured against gpt-4o-mini on a 200-node graph.

appendix a · zep deep #3 · the modes

two search calls, three result shapes. pick by question type.

Zep ships graph.search and memory.search_sessions as the two public read paths. They hit different indexes, return different shapes, and answer different questions. The wrong one looks broken on the right query.

mode 01

graph search

Vector search over fact embeddings. Returns ranked edges. Fast, semantic, no graph traversal. Best when the question is "what do you know about X," not "how is X connected."

# shape {"edges": [ {"fact": "...", "score": 0.81, "valid_at": "..."} ]}

use when: you want recall over the whole graph and you trust embedding similarity.

mode 02

graph search · hybrid

Vector recall, then graph expansion (1 or 2 hops from each hit), then a cross-encoder reranker. The slowest path. The most accurate. Cite chains come back attached.

# shape {"edges": [...], "nodes": [...], "episodes": [...] } # reranker has reordered

use when: the agent will cite the fact to a user. you need provenance, not just relevance.

mode 03

memory search sessions

Conversation-scoped recall. Searches only within the threads tied to a session_id. Returns message-level chunks plus a context summary the runtime can paste.

# shape {"messages": [...], "context": "summary text", "facts": [...] }

use when: the user is in a conversation and you want "what did we talk about last week" rather than "what is true."

→ docs: help.getzep.com/searching-the-graph. Hybrid uses Cohere rerank by default in Zep Cloud, swappable to bge-reranker in self-host. The choice between modes 01 and 02 is usually a latency budget call, not a quality call.

appendix a · zep deep #4 · the deployment

zep cloud wraps graphiti. graphiti standalone is the engine alone.

option a

zep cloud

whatmanaged graphiti, hosted neo4j, hosted reranker, web UI, auth, project scoping
sdkzep_python.client.Zep(api_key=...)
storagetheir infra. you do not see the graph file
costfree tier + usage-priced beyond
latencynetwork hop, but reranker is hot
good forshipping fast, sharing across machines, dashboard introspection
option b

graphiti standalone

whatthe open-source graph engine. neo4j or kuzu as backend. no UI, no reranker UI
sdkgraphiti_core.Graphiti(uri, user, password)
storageyour neo4j or your local kuzu file
costyour llm tokens, your db host
latencylocal kuzu is sub-ms reads; neo4j is your choice of host
good forbyo-everything stacks, on-prem requirements, byo reranker

the same EntityEdge model. Both paths use the dataclass on slide 30. Zep Cloud is Graphiti plus a hosted layer. Standalone is Graphiti without the layer. Cloud reads at scale benefit from the project scoping primitives (multi-tenant graph isolation), which standalone leaves to you.

If you are building a single-agent personal assistant, standalone over kuzu is the lightest path. Five minutes to running. Zero hosted dependencies. The graph file lives next to your repo.

If you are building a multi-user product, Zep Cloud earns its keep on the auth + project isolation + dashboard alone. The reranker is the second reason. Self-hosting a reranker is a separate gpu line item.

→ standalone walk-through at /lab/graphiti-standalone (kuzu backend, ten-minute scratch graph, no hosted deps). Repo: github.com/getzep/graphiti. Cloud docs: help.getzep.com.

appendix a · letta deep #1 · the loop

letta is an agent runtime, not a memory library.

Letta's three-tier memory model. Core (always-on), Recall (conversation history), Archival (vector store). Maps to what production agents actually ship versus what builders think they ship. verbatim · letta.com/blog/agent-memory + POSITIONING.md

why the runtime owns memory. A library asks "where do I store this fact." A runtime asks "when in the loop should the model see this fact." Letta is the second question. Steps 02 and 04 are the answers. Core blocks ride in the prompt every turn. Recall lookups happen only when a tool calls for them. The model never decides where memory lives, only when to write.

Step 05 is the loadbearing one. The agent can rewrite its own persona or human block mid-turn via a tool call. The new block ships in the prompt on the next step. This is how Letta's "agent that learns about you" demos work. The runtime persists the edit between processes.

The OS analogy from the MemGPT paper is exact. Core memory is RAM. Recall is the page file. Archival is disk. The agent is the operating system scheduling reads and writes against its own context window.

→ blog post: letta.com/blog/letta-v1-agent. Original paper: Packer, Wooders, Lin et al. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. Loop reference: letta/agent.py::step (open source under github.com/letta-ai/letta).

appendix a · letta deep #2 · the layers

core is a list of named blocks. the agent rewrites them.

Core memory is not a freeform string. It is a list of labeled blocks. Each block is a chunk the agent can read and overwrite by label. Two come default. You can add as many as you need.

human

What the agent knows about the user. Edited by the agent during turns. Default size cap ~2KB. The "remember that I prefer X" landing zone.

persona

What the agent thinks it is. Self-description, tone, role. The agent can rewrite this too. The "I am a helpful coding assistant who keeps it terse" block.

custom · your label

Arbitrary keyed slots. project_context, open_threads, relationship_status. Up to you. Each one is a string the model owns.

letta · block mutation · realistic callapache 2.0
from letta_client import Letta

client = Letta(base_url="http://localhost:8283")

# update one block on a live agent. the next turn sees the new value.
client.agents.blocks.modify(
    agent_id=agent.id,
    block_label="human",
    value="Ray runs VCN at Frontier Tower. Prefers jade-teal. Ships fast.",
)

# add a brand new keyed block at runtime.
client.agents.blocks.create(
    agent_id=agent.id,
    label="open_threads",
    value="VCN-33 deck due Mon. Awaiting Daniel quote.",
    limit=2000,
)

why blocks instead of one freeform string. Three reasons. One, the model can mutate one block without nuking the rest, which keeps edits surgical. Two, blocks are diffable, so the dashboard can show "what changed in core this turn." Three, blocks are inspectable from outside the agent, so ops can read the persona without a tool call.

The cost is the cap. Each block has a limit in tokens. Hit the cap and the runtime refuses the write. This is the design pressure that pushes cold facts down to the Archival tier, where there is no cap.

→ reference: docs.letta.com/concepts/memory-blocks. SDK methods agents.blocks.modify and agents.blocks.create per the v1 Python client. The dashboard at localhost:8283 renders blocks live.

appendix a · letta deep #3 · the substrate

archival is a vector store. letta hides the backend.

The Archival tier on slide 10 is not magic. It is a vector index. Letta abstracts the backend so the agent code does not change when you swap pgvector for sqlite-vec for Chroma.

the tradeoff that matters. Letta-managed storage gets you to a working agent in minutes. The cost is that you do not own the schema. Migrating a million archived passages between Letta backends is a Letta-supported flow, not a direct DB dump.

If your team already runs a vector DB for RAG, the BYO config keeps the agent's archival in the same store. One index, one ops surface. The agent gets archival_memory_search() as a tool. The tool hits your existing index.

The escape hatch is real. Every Archival row carries an opaque metadata blob. You can write rows from outside Letta and the agent will retrieve them through the same tool call. Useful for seeding an agent with a knowledge base it never lived through.

letta · archival backend configapache 2.0
# server-side: ~/.letta/config.toml
[archival_storage]
type = "postgres"
uri  = "postgresql://localhost:5432/letta"

# or, for a one-file local dev setup:
[archival_storage]
type = "sqlite-vec"
path = "./letta_archival.db"

# the agent code does NOT change. archival_memory_search() works either way.

→ reference: docs.letta.com/server/configuration. Source: letta/orm/archival_passage.py. The cloud default is pgvector. Local letta server defaults to sqlite-vec.

appendix a · letta deep #4 · the surface

one runtime. three model lanes. three deployment shapes.

Letta does not bind to a single provider. The model field on an agent is a routing string. Change the string, the loop on slide 34 runs against a different brain. The blocks, the archival, the recall, do not move.

lane a · proprietary openaiopenai/gpt-4o-miniopenai key from env
lane b · proprietary anthropicanthropic/claude-sonnet-4-6anthropic key from env
lane c · local open weightsollama/llama-3.1-8bollama running on localhost:11434
letta · model switching at create-timeapache 2.0
from letta_client import Letta

client = Letta(base_url="http://localhost:8283")

# same blocks. same archival. three different brains.
prod_agent = client.agents.create(
    name="ray_prod",
    memory_blocks=[{"label": "human", "value": "Ray runs VCN."}],
    model="anthropic/claude-sonnet-4-6",
    embedding="text-embedding-3-small",
)

dev_agent = client.agents.create(
    name="ray_dev",
    memory_blocks=[{"label": "human", "value": "Ray runs VCN."}],
    model="openai/gpt-4o-mini",            # cheaper, faster, for dev loops
    embedding="text-embedding-3-small",
)

offline_agent = client.agents.create(
    name="ray_offline",
    memory_blocks=[{"label": "human", "value": "Ray runs VCN."}],
    model="ollama/llama-3.1-8b",          # no network, runs on your laptop
    embedding="ollama/nomic-embed-text",
)

→ docs: docs.letta.com/models. Provider strings: openai/<model>, anthropic/<model>, ollama/<model>, letta/letta-free (free tier on letta cloud). Embedding strings follow the same pattern.

appendix a · mem0 deep #1 · the layers

mem0 is three storage layers behind one method.

Mem0 wraps vector + graph + reranker behind a managed memory API, so a builder can add(), search(), and forget across sessions without picking a stack. verbatim · mem0.ai/blog/state-of-ai-agent-memory-2026 + mem0/mem0 readme

the managed API hides this. A user of the 3-line drop-in on slide 13 never sees the layers. m.add() writes to vector. m.search() reads from vector. Reranker is configured at construction. The graph layer is dormant until enabled.

The graph layer is the part most users miss. Once you flip graph_store on, every add() runs entity extraction in addition to embedding. Searches can then ask "what does mem0 know about Ray and what entities relate to him" rather than just "what messages near 'Ray.'"

mem0 · expose the layers via from_configapache 2.0
from mem0 import Memory

# expose the substrate. flip on the graph layer.
config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {"host": "localhost", "port": 6333},
    },
    "graph_store": {
        "provider": "neo4j",
        "config": {"url": "bolt://localhost:7687", "username": "neo4j", "password": "..."},
    },
    "llm": {"provider": "anthropic", "config": {"model": "claude-sonnet-4-6"}},
    "embedder": {"provider": "openai", "config": {"model": "text-embedding-3-small"}},
}

m = Memory.from_config(config)

# add() now writes to BOTH vector and graph.
m.add("Ray prefers jade-teal. He runs VCN at Frontier Tower.", user_id="ray")

# search() with relations=True returns the graph traversal.
hits = m.search("what color does Ray prefer", user_id="ray", filters={"include_graph": True})

→ docs: docs.mem0.ai. Source: github.com/mem0ai/mem0. Config shape per Memory.from_config. The default Memory() constructor uses local Qdrant + OpenAI embeddings + no graph + no reranker.

appendix a · mem0 deep #2 · the controls

scope by category. tune the extractor. stop mem0 from saving noise.

The 3-line demo is the floor. Production mem0 needs scoping (which memories belong to which user, session, or agent) and filtering (which messages are worth keeping at all). Both are first-class.

mem0 · scoping + custom extractorapache 2.0
from mem0 import Memory

# tune what counts as a fact worth keeping.
custom_extractor = """
You are extracting memories for a VCN host. Keep:
- explicit preferences ("I prefer X")
- commitments and dates
- relationships between named people and orgs
Drop: small talk, weather, transient logistics.
Return JSON: {"facts": ["fact 1", "fact 2"]}
"""

m = Memory(config={
    "llm": {"provider": "anthropic", "config": {"model": "claude-sonnet-4-6"}},
    "custom_fact_extraction_prompt": custom_extractor,
})

# scope by both user and session. add a category for later filtering.
m.add(
    messages=[{"role": "user", "content": "I prefer jade-teal posters. VCN is on Mondays."}],
    user_id="ray",
    run_id="session_2026_05_11",
    metadata={"category": "preferences", "source": "telegram"},
)

# search only inside one category.
hits = m.search("poster color", user_id="ray", filters={"category": "preferences"})

why custom prompts matter. The default extractor keeps a lot. Names, numbers, every preference signal. On a high-volume agent the store balloons. A custom prompt with a tight "keep / drop" rubric cuts the store by 60-80% in practice. The drop list is the load-bearing half.

why scope matters. Without user_id filtering you will recall the wrong user's memories. Without run_id separation, a one-off session leaves long-term residue you did not intend. Both are cheap to add. Both are easy to forget.

→ docs: docs.mem0.ai/core-concepts/memory-types. The custom_fact_extraction_prompt field lives at the top level of the Memory() config. The metadata dict is stored on the row and queryable via filters.

appendix a · mem0 deep #3 · the conflict

new fact contradicts old fact. mem0 mutates the store.

Mem0 calls itself self-improving memory. The practical version: every add() runs a conflict pass against existing memories. New facts can update, supersede, or be merged with prior ones. The store mutates in place rather than appending.

strategy 01 · update

edit in place

New fact partially overlaps old fact. Mem0 rewrites the row, keeps the same id. previous_value is recorded in history.

Example. Old: "Ray works at Sandbox VR." New: "Ray is store manager at Sandbox VR SF Flagship." Result: one row, the longer one wins.

strategy 02 · supersede

mark old, write new

New fact directly contradicts old. Old row goes into history. New row takes the live slot. The agent never sees the stale fact on search.

Example. Old: "Ray lives in Brooklyn." New: "Ray moved to SF." Result: SF is live; Brooklyn is in history.

mem0 · observe the conflict resolutionapache 2.0
from mem0 import Memory

m = Memory()

# t=0. write the original.
m.add("I live in Brooklyn and work remote.", user_id="ray")

# t=1. write the update. mem0 will classify as supersede.
m.add("I moved to SF for the Sandbox VR job.", user_id="ray")

# the store now has SF, not Brooklyn.
hits = m.search("where does Ray live", user_id="ray")

# the audit trail exists. ask for it.
for h in hits["results"]:
    print(h["memory"], "->", h.get("history", []))

→ reference: mem0.ai/blog/state-of-ai-agent-memory-2026. The classify step is the load-bearing one. The LLM picks one of ADD / UPDATE / DELETE / NONE per candidate against per-neighbor pair. Verdict prompt lives in mem0/memory/main.py.

appendix a · mem0 deep #4 · the production config

three lines is the demo. thirty lines is production.

The three-line drop-in is real. It is also a development setup. A mem0 deployment that handles thousands of users, real latency budgets, and an audit trail has thirty lines of config and policy wrapped around the same three verbs.

pattern 01 · pagination on search

Default limit=10. Pass limit and offset for stable scrollthrough. The reranker only sees the top-k, not the page.

pattern 02 · batch add

Multiple messages=[...] entries in one add() call. Extraction happens once over the whole batch. Fewer llm round-trips.

pattern 03 · eviction policy

Mem0 does not auto-prune. Scheduled job calls delete_all by filters for stale categories. Time decay is your problem.

pattern 04 · observability hook

Pass telemetry in the config to forward every add and search to your own logger. Latency and verdict per call.

mem0 · realistic production config (~30 lines)apache 2.0
from mem0 import Memory
import logging

logger = logging.getLogger("mem0.ops")

config = {
    # backends. all self-hosted. no cloud round-trip.
    "vector_store": {
        "provider": "qdrant",
        "config": {"host": "qdrant.internal", "port": 6333, "collection_name": "vcn_memories"},
    },
    "graph_store": {
        "provider": "neo4j",
        "config": {"url": "bolt://neo4j.internal:7687", "username": "neo4j", "password": "..."},
    },
    "llm": {
        "provider": "anthropic",
        "config": {"model": "claude-sonnet-4-6", "temperature": 0.0, "max_tokens": 1024},
    },
    "embedder": {
        "provider": "openai",
        "config": {"model": "text-embedding-3-small"},
    },
    # tune extraction. drop the chatter, keep the facts.
    "custom_fact_extraction_prompt": OPS_EXTRACTOR_PROMPT,
    # forward every call to internal telemetry.
    "history_db_path": "/var/lib/mem0/history.db",
    "version": "v1.1",
}

m = Memory.from_config(config)

# wrap the surface for logging + retries.
def add_with_audit(messages, user_id, run_id, category):
    res = m.add(
        messages=messages,
        user_id=user_id,
        run_id=run_id,
        metadata={"category": category, "ts": int(time.time())},
    )
    logger.info("mem0.add", extra={"verdicts": res, "user_id": user_id})
    return res

operational simplicity is the pitch. the three lines on slide 13 are honest about the floor. the thirty lines here are honest about the ceiling. the gap is where production lives.

→ reference: github.com/mem0ai/mem0 readme, docs.mem0.ai/integrations. The from_config constructor is the only path that exposes the full surface. Default Memory() is the friendly cousin. Both call the same internals.

appendix · wing I deep · 1 of 4 · the memory tool internals

THE MEMORY TOOL IS A FILESYSTEM CALLABLE.

Anthropic's memory tool is not a vector store, not a database, not a service. It is a callable the model invokes inside its tool loop, backed by a server-managed markdown filesystem. The agent issues `view`, `create`, `str_replace`, `insert`, `delete`, and `rename` against paths under `/memories`.

The interesting design choice is the one missing thing. There is no scoring, no eviction policy, no semantic indexing. The agent decides when to call it. The operator supplies a system prompt that tells the model when it is worth remembering something, and the model writes a markdown file. On the next turn the model lists `/memories`, reads the relevant file, and pulls it into the working set.

This is the opposite of the Mem0 pattern. Mem0 listens to messages and silently decides what is fact-worthy. The memory tool makes the write decision explicit and visible inside the trace.

tool.json · faithful reconstruction
{
  "name": "memory",
  "description": "Read and write notes to a persistent markdown filesystem at /memories.",
  "input_schema": {
    "type": "object",
    "properties": {
      "command": {
        "type": "string",
        "enum": ["view", "create",
                 "str_replace", "insert",
                 "delete", "rename"]
      },
      "path": { "type": "string" },
      "file_text": { "type": "string" },
      "new_str":   { "type": "string" },
      "old_str":   { "type": "string" }
    },
    "required": ["command", "path"]
  }
}
view create replace insert delete

→ anthropic.com / engineering / effective-context-engineering-for-ai-agents. the memory tool ships as a public beta in the messages API; storage is operator-managed (Anthropic does not host the markdown). the agent self-curates without any operator policy code.

appendix · wing I deep · 2 of 4 · compaction internals

AT 180K TOKENS THE HARNESS REWRITES THE TRANSCRIPT.

Claude Code's default compaction trigger sits near 180K tokens. When the working transcript crosses that line, the harness pauses, sends the entire transcript to a summarization pass, and resumes the agent loop with the summary in place of the raw history. The user sees a one-line notification. The agent sees a shorter context that still encodes the decisions it made.

What survives is opinionated. Decisions made, files touched, error messages worth remembering, the last few turns of reasoning verbatim. What dies is the bulk. The 50KB file dump from one `Read`. The eight `Grep` results that returned the same file path. The tool outputs the agent used once and never needed again.

Three knobs operators reach for. Lower the threshold to compact sooner (cheaper turns, more frequent summarization tax). Raise it to delay compaction (longer coherent windows, larger summarization spike). Disable compaction entirely and rely on the model's own context window plus manual hand-offs. Anthropic-managed agents expose this as `compaction_strategy` in the harness config.

100K · safe
180K · compact
compaction prompt · faithful reconstruction
You are summarizing a long agent transcript so the
agent can continue without losing the thread.

KEEP verbatim:
  - the user's original task and any clarifications
  - decisions the agent made and why
  - file paths read or written
  - the last 5 turns of reasoning

DROP:
  - full file contents (cite path + range instead)
  - tool result bodies > 2KB unless the agent
    references them in later reasoning
  - repeated Grep or Bash output

Output: a markdown brief under 4K tokens.

→ anthropic.com / engineering / effective-context-engineering-for-ai-agents and the public Claude Code docs at code.claude.com / docs / en / context. compaction is one of the four levers; the other three are tool-result clearing, sub-agent delegation, and the memory tool.

appendix · wing I deep · 3 of 4 · tool-result clearing

THE AGENT CAN MARK A TOOL RESULT AS DONE.

The pattern is simple and underused. When the model finishes processing a tool result, it can flag it as no longer needed. The harness prunes the body from the active context on the next turn. The tool call itself stays in the transcript (so the trace is honest), but the bulk payload is gone.

The canonical case is `read_file` on a large source file. The model needs the contents once to reason about a fix. It writes the patch. The 50KB file body now contributes nothing except cost on every subsequent turn. Marking it cleared drops the cost to a few tokens of structural metadata.

Compare this to compaction. Compaction is operator-policy and triggers at a threshold. Tool-result clearing is agent-policy and triggers turn by turn. The two compose. A long-running agent with both enabled stays under threshold for far longer because the working set never accumulates dead weight in the first place.

before · turn 17 context
user_taskfix the typed-error bug in kernel/capabilities/twitter.py
tool_result 1read_file kernel/capabilities/twitter.py · 47.3 KB body
tool_result 2grep _classify_error · 11 matches across 4 files
tool_result 3read_file kernel/_base.py · 8.1 KB body
tool_result 4edit_file · success · 12 lines changed
total~68 KB · ~17K tokens
after · turn 18 context · agent cleared 1+2+3
user_taskfix the typed-error bug in kernel/capabilities/twitter.py
tool_result 1read_file kernel/capabilities/twitter.py · 47.3 KB body
tool_result 2grep _classify_error · 11 matches across 4 files
tool_result 3read_file kernel/_base.py · 8.1 KB body
tool_result 4edit_file · success · 12 lines changed
total~3 KB · ~750 tokens
23x smaller working set, every turn from now on.

→ anthropic.com / engineering / effective-context-engineering-for-ai-agents names tool-result clearing alongside compaction and sub-agent delegation. the messages API exposes it as a per-tool-call boolean the model sets when responding.

appendix · wing I deep · 4 of 4 · sub-agent delegation

SPAWN A SUB-AGENT. KEEP THE NOISE OUT OF THE PARENT.

Anthropic's harness essay frames sub-agent delegation as a memory hygiene primitive, not a parallelism primitive. The parent agent calls a sub-agent for a bounded subtask. The sub-agent gets a fresh context window, does the work, and returns one result. The parent never sees the intermediate dump.

Think of it as scoped variables for context. The exploration step that requires reading twelve files and running four greps happens inside the sub-agent. The parent receives a paragraph that says "looked at twelve files, the relevant function is at this path, here is its signature." That paragraph is now the only thing the parent has to carry.

The cost trade is exact. You pay for the sub-agent's full context separately, then throw it away. The parent's context stays compact. For deep research and refactors this almost always wins; the parent's context is the precious resource because it carries the user's intent.

parent / child context flow
P0 user: "find the regression in the auth middleware."
P1 agent: I need a deep audit. Spawn `auth-auditor`.
C1 auth-auditor: read 12 files in middleware/, grep for session calls, inspect error logs
C2 auth-auditor: ~85K tokens consumed across 14 turns inside its OWN context window
C3 auth-auditor returns: "regression in session.refresh(), line 47, introduced 2026-04-22 commit a3f1, fixes by reverting the conditional."
P2 agent receives the 1-paragraph summary. Parent context still at < 20K.
net effect: parent context cost ~1K tokens for the entire audit. without delegation: parent absorbs all 85K and competes with the user's intent for working-set space.

→ anthropic.com / engineering / effective-harnesses-for-long-running-agents. the framing is explicit: delegation is "memory hygiene by quarantine." Claude Code's `Agent` tool is the production reference; the Managed Agents SDK exposes the same primitive.

appendix · wing IV deep · 1 of 4 · the auto-memory filesystem

THE FILESYSTEM IS THE REAL CONTRACT.

Claude Code v2.1.59 and later ship auto-memory default-on. The agent reads two files on every session start. `CLAUDE.md` from the project root, which the human edits, and `MEMORY.md` from `~/.claude/projects/<encoded-cwd>/memory/`, which the agent writes.

The encoded path is the current working directory with slashes and colons replaced by dashes. For this very deck-build session that is `C--Users-jtole-Documents-2026-life`. Each memory entry is its own markdown file; `MEMORY.md` is just the index. When the agent learns something worth keeping, it writes a new file (e.g. `feedback_no_dashes.md`) and appends one line to the index.

The hook contract is asymmetric on purpose. The agent never writes the human's CLAUDE.md. The human can edit either file but typically does not touch MEMORY.md. The shared filing cabinet has two drawers and two locks.

~/.claude/projects/<encoded-cwd>/memory/ · real layout
~/.claude/projects/C--Users-jtole-Documents-2026-life/memory/
├── MEMORY.md                       # index, agent-written
├── user_profile.md                 # who Ray is
├── reference_directory.md          # names · tg · email
├── feedback_no_dashes.md           # voice rule
├── feedback_facilitator.md         # billing rule
├── project_self_hosted_kv.md       # self-hosted redis
├── project_youtube_ingest.md       # yt-dlp + groq
└── _archive/                       # aged-out entries

~/Documents/2026/life/
└── CLAUDE.md                       # human-written, project rules
CLAUDE.mdhuman writes. agent reads.
MEMORY.mdagent writes. agent reads.
entry .mdagent writes one per topic.

→ code.claude.com / docs / en / memory. each entry carries frontmatter (name, description, type) so the agent can decide whether to load it. typical entry types: `user`, `feedback`, `project`, `reference`. inspect any live MEMORY.md to see the structure.

appendix · wing IV deep · 2 of 4 · the hooks system

HOOKS ENFORCE. MEMORY ONLY ASKS.

Claude Code's settings.json defines lifecycle hooks. Bash or Node commands the harness runs on specific events. These are policy that the agent cannot opt out of. If the agent tries to end a session with uncommitted work, the `Stop` hook fires `atomic-commit-gate.js` and blocks the close.

The events you actually use are short.

SessionStart · crash recovery, baselines UserPromptSubmit · prompt expansion PreToolUse · gate before a tool runs PostToolUse · react after a tool ran Notification · external signal handler Stop · final pass before session ends

The clean separation is the point. Memory says please. A line in CLAUDE.md that reads "never run parallel Telegram connections" is a hope that the agent reads and remembers it. A `PreToolUse` matcher on `Bash` that greps for `telethon.*parallel` and exits non-zero is a wall.

settings.json · the atomic-commit-gate hook
{
  "hooks": {
    "Stop": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "node ~/.claude/hooks/atomic-commit-gate.js"
          }
        ]
      }
    ],
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          { "command": "node ~/.claude/hooks/session-writes-tracker.js" }
        ]
      }
    ]
  }
}
# Stop hook exits non-zero if `git status --porcelain` has output.
# The session refuses to end. The user sees the diff and decides.
memory vs hooks: a memory entry that says "commit before ending" is a soft norm the agent may forget under context pressure. The `atomic-commit-gate.js` hook makes it a runtime invariant. Both exist in this repo; the hook is what catches the mistakes the memory misses.

→ code.claude.com / docs / en / hooks. real hook example from `~/.claude/settings.json` in the life repo. enforcement code beats prose policy when stakes are high. use both: prose to explain why, hook to make sure.

appendix · wing IV deep · 3 of 4 · the rules-file comparator

FIVE FILES. FIVE PARENTS. ONE PATTERN.

file harness scope who writes agent self-edits format
CLAUDE.md+ MEMORY.md Claude Code user (~/.claude) + project (./CLAUDE.md) human for CLAUDE.md, agent for MEMORY.md YES (memory only) plain markdown, no schema required
.cursorrules+ .cursor/rules/ Cursor project root (and per-file globs in .cursor/rules/) human NO plain markdown, optional `globs:` frontmatter
AGENTS.mdcodex flavor Codex CLI, OpenAI agents project root, nested by directory human NO plain markdown, free-form sections
.windsurfrulesglobal + workspace Codeium Windsurf, Cascade global (~/.codeium) + workspace (./.windsurfrules) human NO plain markdown, ~6K char soft cap
.aiderconfig.yml+ CONVENTIONS.md Aider project root human NO yaml config + free-form markdown convention file

→ the honest read: four of the five are static markdown the human edits. Claude Code is the only one where the agent curates a sibling file the operator typically does not touch. That is the only structural difference. Everything else is naming, scoping, and which directory the harness scans.

appendix · wing IV deep · 4 of 4 · why procedural quietly won

A 200 LINE CLAUDE.MD
beats a vector store for most teams shipping today.

argument 1 · token cost

retrieval is free when memory lives in the prompt.

A 200 line CLAUDE.md is roughly 3K tokens. The agent pays once per session, then nothing per turn for the rest of the working window. A vector store retrieval pays embedding cost on write, network latency on read, and reranker cost per query. For procedural facts (style rules, conventions, do-not-touch lists) the in-prompt cost is lower and the latency is zero.

argument 2 · eval cost

you can A/B a markdown diff.

Change one rule in CLAUDE.md. Re-run the eval suite. See whether the metric moved. Procedural memory is a unit of change with a clear before and after. Vector stores reshape under every embedding model upgrade and every chunking choice. You cannot diff a vector store the way you can diff a markdown file.

argument 3 · team coherence

rules are pull requests. vector stores are not.

Procedural memory lives in the same git repo as the code. Adding a new convention is a PR with a diff and a reviewer. Six months in, you can `git log CLAUDE.md` and see why every rule landed. A team-shared memory layer that does not show up in code review is a layer the team does not actually own.

"the rules-files pattern in general. .cursorrules, CLAUDE.md, .windsurfrules, AGENTS.md. the DIY procedural memory that quietly won for most teams. this is the highest-leverage memory layer for builders shipping today." vcn-33 research notes · 2026-05-04

→ this is the case the next two years of agent tooling will either defend or kill. the bet here: even when frameworks like Letta and Mem0 win on persistent state, the procedural layer stays in markdown next to the code. it is too cheap, too diff-friendly, too human-legible to lose.

s50 · eval · what "memory works" even means

YOU CAN'T A/B TEST REMEMBERING.

Every memory framework page claims accuracy gains. None of them mean the same thing by accuracy. The community settled on LOCOMO as the closest thing to a yardstick. Long Conversational Memory. A 600-turn synthetic dialogue corpus across 10 sessions per persona, annotated for what the agent should still know on turn 600 that was said on turn 17. The scores are low. Frontier models clear ~50 to 60 percent on single-hop factual recall and crater on multi-hop reasoning over the same dialogue.

what LOCOMO actually scores
single-hopdid you remember a fact stated once N turns ago. the easy axis.
multi-hopcombine two facts from two different sessions. where everything falls apart.
temporalwhen was it true. which session. before or after the user changed their mind. zep's home turf.
open-domaingenerate a free-text answer grounded only in remembered context. graders disagree.
adversarialthe dialogue contains contradictions. resolve them. most systems pick the most recent. wrong half the time.

"did my agent get better at remembering" is a research question, not a unit test.

→ LOCOMO paper: arXiv:2402.17753 · referenced by every 2026 memory framework as the leaderboard nobody quite trusts.

s51 · eval · the LOCOMO setup, in five steps

FIVE STEPS. THREE HOURS. NOT FREE.

01 pull corpus download the LOCOMO dataset (10 personas, 600 turns each, ~70k tokens per persona). split into sessions and probes.
02 ingest feed every session to the memory layer under test. mem0 .add(), letta core block writes, zep add_messages(). one persona per agent instance, no leakage.
03 probe replay the 1000+ probe questions. each one targets a specific session and turn. record the agent's answer plus what it retrieved.
04 grade two graders. exact match for factual probes, an LLM judge for open-domain. report agreement. low agreement means the judge prompt is wrong, not the agent.
05 slice break the score by axis (single-hop, multi-hop, temporal, adversarial). report all five. one aggregate number is a lie.
costa single LOCOMO sweep across one framework runs ~$40 to $120 in API calls. four frameworks for a real comparison: ~$400. nobody expenses this for a blog post.
setup timeevery framework has its own ingest dialect. wiring four of them to the same corpus is a two day job before you grade a single answer.
signalthree runs of the same eval against the same framework give three different scores. graders are noisy. you need N=5 to publish. five times the cost.

→ ray's lab scaffolds the run loop: /lab/locomo-eval · skeleton, not a finished comparison. you bring the budget.

s52 · multi-agent · the unsolved problem

TWO AGENTS, ONE STORE. NOBODY HAS SHIPPED THIS.

Single-agent memory is a write-then-read problem. Multi-agent memory is a distributed-systems problem. Two agents working the same project, same user, same hour. Both write. Both read. The store has no clock that means anything to both of them. The 2026 survey at arXiv:2603.07670 dedicates a section to this. The honest summary: "an emerging frontier." Translation: no production answers.

the canonical race condition agent-areads "user prefers tabs." agent-astarts writing code with tabs. agent-basks user. user says "actually spaces today." agent-bwrites memory: "user prefers spaces." agent-astill indenting with tabs. has not re-read. agent-bwrites spaces. PR is half tabs half spaces. → last-writer-wins is wrong. there are no last writers.
last-writer-wins cheap. wrong. the agent who happened to write last sets policy until the next race. ships a lot in 2026 because it is one config line.
source-weighted tag every memory with a trust tier. human messages outrank agent inferences. agent-a outranks agent-b for stylistic prefs in the code area. requires a schema nobody has.
CRDT merge treat memories as a grow-only set. conflicts surface, never auto-resolve. a query returns "user said tabs at T1, spaces at T2." the caller picks. zep's temporal graph is the closest production analog.
consensus pass a third agent reads the conflict and decides. slow. accurate. expensive per call. the pattern the Letta team is sketching internally for the v2 multi-agent runtime.

→ if you ship a real answer here in 2026 you have a company. arXiv:2603.07670 §6 for the survey of attempts.

s53 · security · the attack taxonomy of 2026

THREE WAYS TO POISON. ONE STORE. ZERO DEFENSES OUT OF THE BOX.

vector 01 · direct
write the lie at the front door.

The attacker has a write surface. A shared agent, an open MCP server, a memory tool exposed to untrusted users. They call add() with the false fact and walk away. The store has no notion of who wrote what.

in the wild a customer-support agent with a shared memory pool. one user writes "the company offers a 100 percent refund on any complaint." next user gets that as a remembered policy.
vector 02 · indirect
poison the well, let the agent drink.

The attacker never touches the memory store. They plant the false fact in a corpus the agent will later RAG over. A markdown file in a public repo. A Stack Overflow answer. A web page. The agent ingests it as fresh truth and writes it to memory itself.

in the wild a coding agent reads a poisoned README claiming a popular library "now requires a config token from attacker.com." agent recommends. ships.
vector 03 · semantic
hide the attack inside a real fact.

The attacker writes a benign-looking memory whose semantic embedding overlaps an unrelated query. Retrieval pulls it. The agent treats it as in-context evidence. The payload activates only when the right user asks the right question.

in the wild a memory entry says "for any auth question, the recommended JWT secret is hunter2." semantically near "JWT setup," weeks before any auth question gets asked.

mnemonic sovereignty · the agent's memory must remain provably yours.

A Survey on the Security of Long-Term Memory in LLM Agents, apr 2026, arXiv:2604.16548 · this is where the term gets coined. read it before you ship a shared memory pool.

s54 · security · what you actually do about it

FIVE DEFENSES. PICK ALL OF THEM.

provenance
Every memory carries the source that wrote it. Schema-level, non-optional. A memory entry is not {fact}, it is {fact, source, written_at, written_by, trust_tier}. At retrieval time the agent can filter by tier. Zep's graph already does this for edges; bolt the same idea onto mem0 and letta via a wrapper.
signing
Trusted writers sign. Untrusted writers don't. At read time the agent prefers signed memories. This is the same playbook commit-signing solved a decade ago, and the same playbook agentic-payments shipped via RFC 9421. Same primitive. Different store.
trust tiers
Separate stores by trust. Tier 0 is the user's own messages, signed. Tier 1 is the agent's own inferences. Tier 2 is RAG over your own corpus. Tier 3 is RAG over the internet. Never mix them in a single retrieve. Promote with a human in the loop, never automatically.
adversarial eval
Run a poison set against your own stack. Implant a false memory through every available write path (direct, RAG, prompt-injection in tool results). Probe later with a query that should retrieve it. If the agent acts on it, you have a real failure, not a theoretical one. Treat it like a regression test.
honeypotting
Run a decoy agent surface in front of the real one. Log every write. Look for the patterns no real user produces: shitcoin recommendations, fake API endpoints, novel-but-plausible security advice. Lobsterhoney (ray's other project) is exactly this for ai-agent traffic: a honeypot + audit for the agentic web. Same primitive, applied to your memory store.
defense in the lab

/lab/memory-poison walks the full red-team. plant a false fact via three vectors, probe with three queries, watch which framework betrays you. mem0 fails differently than zep, which fails differently than CLAUDE.md. open the lab.

→ defense vocab from arXiv:2604.16548 §5 · honeypot pattern from lobsterhoney.com · the broader argument: signing + tiers + provenance is the minimum stack for any shared memory in 2026.

s55 · production reality · the framework question is a layer question

NOBODY PICKS ONE. THEY STACK THREE.

case a · solo coder the file is the stack.
toolsClaude Code
proceduralCLAUDE.md (manual)
autoMEMORY.md (v2.1.59+)
retrievalMCP server (optional)

Why it wins: zero infra. one editor. memory lives in git. no vector store to debug. the second a project outgrows it, you know.

case b · production chatbot the framework eats the plumbing.
runtimeLangGraph
memoryMem0 (managed API)
resumePostgresSaver checkpointer
observabilityLangSmith

Why it wins: three managed layers, one ops surface. Mem0 handles fact-worthiness so your prompt stays focused. checkpointer handles process crashes. LangSmith handles "why did the agent say that." nothing custom.

case c · agent-native product own the loop, own the store.
runtimeLetta v1
graphGraphiti standalone
substratepgvector + Postgres
protocolMCP for tools, A2A for agents

Why it wins: Letta owns core context. Graphiti owns temporal facts. pgvector owns the embedding pool. you own the schema. nothing leaves your boundary. expensive in calendar weeks, cheap forever after.

/lab/hybrid-stack walks case b end to end · case c lives in any agent-native product you respect · case a is what you have right now.

s56 · forgetting · the policies nobody writes down

"NEVER FORGET" IS WRONG. RETRIEVAL GETS NOISIER EVERY DAY.

time-decay (LRU) score = recency × 1 / (age_days + 1)
wins for conversational agents where what mattered last week probably matters now.
loses for long-arc facts. user's birthday, employer, allergies. these never get re-mentioned and silently fall out of cache.
salience-scored score = relevance(query) × recency × user_emphasis
wins for agents with structured queries. the relevance signal does the actual work and recency is the tiebreaker.
loses for open-domain assistants. you can't compute relevance without a query, and you don't have a query at write time.
event-driven on logout: archive · on contradiction: split + version
wins for session-scoped agents and high-stakes assistants. you know exactly when state can safely turn over.
loses for always-on companion agents where there is no logout, no contradiction, just slow accumulation forever.

The honest production answer is a combination of all three, tuned per memory tier. Mem0 ships a softmax over recency × salience. Zep retains everything but marks edges invalid_at. CLAUDE.md forgets nothing and you pay for it in context bloat.

→ cost-latency tradeoff lives at arXiv:2603.07670 §4.3 · the actual code each framework runs is in their open-source repos. read it before you trust the marketing.

s57 · consolidation · the reflection tax

50 EVENTS A SESSION. 100 SESSIONS. DO THE MATH.

the storage explosion
events per session50
sessions per quarter100
avg tokens per event120
raw episodic store600,000 tok
in-context cost per query (no consolidation)unaffordable

The fix everyone reaches for is reflection: every N events, an LLM summarizes the block into a few semantic facts. The fact count grows slowly while the underlying event log is allowed to balloon and stay cold. Reads target the semantic layer first, the event log second. Letta uses this pattern in Recall. Mem0 calls it the consolidation pass. Anthropic calls it compaction. Same primitive.

episodic events~50 / session
block summary~5 facts / 10 events
semantic store~10 facts / session
in-context~3 facts / query

The risk is the same risk all summarization has. The summary loses the specifics that mattered. The edge case that triggered the bug, the off-handed preference the user only stated once, the contradiction between session 17 and session 42. Compressed away by a summarizer who didn't know what was load-bearing.

→ reflection pattern: arXiv:2603.07670 §3.3 · the practical implementation is in letta/letta under memory/summarize.py · keep the raw log even if you stop reading from it.

s58 · reading list · the 14 documents that move the field

READ THESE IN ORDER. SKIP NONE.

survey · the index Memory in the Age of AI Agents: A Survey

The taxonomy every 2026 paper cites. Three lenses: forms, functions, dynamics. Read this first to align on vocabulary.

survey · the loop Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers

Formalizes memory as write-manage-read. Five mechanism families. The honest "open problems" §6 is worth the read alone.

graph thesis Graph-based Agent Memory: Taxonomy, Techniques, Applications

The case that graphs eat vectors when temporal reasoning matters. Read alongside Zep / Graphiti production docs.

graph thesis · architecture MAGMA: Multi-Graph based Agentic Memory Architecture

A concrete multi-graph layout for episodic, semantic, and procedural memory. Closest thing to a reference design.

security · the new attack surface A Survey on the Security of Long-Term Memory in LLM Agents

Coins mnemonic sovereignty. Three attack vectors, five defenses. Required reading before you ship a shared memory pool.

eval · the yardstick LOCOMO: Evaluating Long-Term Conversational Memory

The benchmark every framework brags about and none of them top. 600 turns per persona. Multi-hop scores will humble you.

framework · graph Zep: A Temporal Knowledge Graph Architecture for Agent Memory

The Zep paper. Valid_at / invalid_at semantics. Async entity extraction. The reference implementation for "what was true on this date."

framework · agent-native Letta v1 Agent Loop

Letta's rethink after ReAct + MemGPT + Claude Code. The three-tier model as a runtime, not a library. Pair with their memory blog post.

framework · managed Mem0 State of AI Agent Memory 2026

The annual report card from the managed-memory team. Field-wide numbers, not vendor-specific. Read the methodology, not the headline.

framework · research arm Mem0 Research Library

Their published evals plus the papers they cite. Compare against Zep's parallel page. Two teams pointing at the same problem from different ends.

framework · learn Zep Learn Hub

Zep's collected long-form on temporal graphs, agent state, and the integration recipes. Better than their docs for "why."

anthropic · context Effective Context Engineering for AI Agents

Anthropic's canonical "memory is one of four levers" post. Read before you reach for a framework. Often the answer is context, not state.

anthropic · harnesses Effective Harnesses for Long-Running Agents

Sub-agent delegation as memory hygiene. The pattern Claude Code itself uses. Applies cleanly outside the Anthropic stack.

anthropic · managed agents Scaling Managed Agents: Decoupling the Brain

Managed Agents architecture. The line between "agent state I own" and "agent state the platform owns." Same question, larger blast radius.

claude code · memory Claude Code Auto-Memory (v2.1.59+)

The first agent that curates its own procedural memory by default. Docs for the MEMORY.md system. Skim before you write a custom CLAUDE.md.

→ all entries pulled from research.md §Sources · this is the reading list ray maintains. expect rotations every quarter as the field moves.

s59 · the close · where the frontier moves in the next 18 months

THREE PREDICTIONS. CHECK BACK IN 2027.

01
prediction · the managed-memory wave

Memory becomes a managed service.

The Mem0 thesis wins on volume. The default for most builders becomes "POST your messages to a vendor and let them decide what to remember," the same way logs went to Datadog and errors went to Sentry. Letta and Zep occupy the agent-native end. The DIY pattern (pgvector + your own logic) gets relegated to teams who genuinely care about the storage shape.

02
prediction · the rules-file standard

Procedural memory becomes a first-class artifact, like .gitignore.

CLAUDE.md, AGENTS.md, .cursorrules, .windsurfrules. The pattern that quietly won keeps winning until somebody publishes a real standard. AGENTS.md is the leading candidate because it is vendor-neutral. Every repo on GitHub ships one by 2027. The agent reads it before doing anything. The same way every repo ships a README the human reads before doing anything.

03
prediction · the poisoning era

Memory poisoning attacks become routine. Defense becomes a fourth quadrant on every product roadmap.

The first public memory-poisoning incident gets reported in 2026. Then a wave. Within a year, "memory hygiene" sits next to authentication, authorization, and audit on the product compliance checklist. Provenance + signing + trust tiers go from research papers to vendor checkboxes. The teams that didn't read arXiv:2604.16548 spend a quarter rewriting their memory layer.

Build memory like you mean it. Or pay for it. vcn #33 · total recall · 2026-05-20

→ ray's bet table for 2027 · these are predictions, not facts. quote them back to me when you find them wrong.

wing i interactive · the token arithmetic

DO YOU FIT IN CONTEXT, OR DO YOU NEED A LAYER?

Move the inputs. Watch the bar. The split shows you what the model is actually being asked to hold. Green means it fits comfortably. Amber means compaction is mandatory. Red means no compaction will save you and you need an external memory layer.

0 used / 200000
100% remaining
FITS COMFORTABLY
category tokens % of window note

note: 800 tokens / turn assumes mixed user prompt + assistant reply. 1500 system-prompt baseline includes tool schemas. Compaction trigger at 90% mirrors Claude Code's default behavior.

wing iii interactive · temporal edges, live

WATCH A FACT GO STALE.

Add a fact as three parts: subject, predicate, object. Each fact gets a valid-from timestamp. Add a contradicting fact (same subject + predicate, different object) and the old one gets superseded, not deleted. Then ask: "what was true at <datetime>?" and the graph answers from history. This is the temporal-edge model behind Zep / Graphiti.

tip: try adding "Rayyan, uses, npm" then "Rayyan, uses, pnpm" then "Rayyan, uses, bun" and watch the older edges get superseded in order.

edges (0)
  • no edges yet. add one above.
query · what was true at
    wing iii sortable · pick a column, see the order shift

    SAME PROBLEM, FOUR SHAPES.

    Four substrates for "remember things across sessions." Click a column header to sort. The setup-time and substrate columns reveal that these are not interchangeable; each one bets on a different primitive. The footer carries the verbatim positioning quote each framework's team uses.

    framework substrate storage model setup time best for lab