BUILDERS USE THE WORD MEMORY FOR FIVE DIFFERENT THINGS.
01 · context
context-window engineering
Fit more useful tokens in the working set. Compaction, tool-result clearing, prompt cache. Solves cost and latency. Does not solve persistence.
02 · retrieval
external retrieval (RAG)
Pull the right chunks at runtime, inject into context. Vector index or knowledge graph. Brings in facts the model never saw. The next session starts fresh.
03 · state
persistent state across sessions
Facts, preferences, prior conversation summaries that survive a restart. Mem0, Letta, Zep. "Remember that I prefer x" across days.
04 · taxonomy
procedural · episodic · semantic
How to do something. What happened when. What is true. Borrowed from cognitive psychology. Production stacks lump all three; the frontier splits them.
05 · meta
memory-as-a-tool
Anthropic's shift. The model gets a memory tool with read / write verbs and decides when to use it. Storage is markdown outside the context window.
→ research.md §1. confusing these is the most common reason agent stacks break.
● the architecture · four wings, one archive
FOUR WINGS. one archive.
Iwing onecontext
context engineering
Anthropic's memory tool. Compaction at the 180K trigger. Tool-result clearing. Sub-agent delegation. Memory inside the model's working set.
IIwing tworetrieval
external retrieval
Vector index versus knowledge graph. The 2026 debate. Vectors win on simplicity and speed. Graphs win on temporal reasoning.
IIIwing threestate
persistent state
The center wing. Three Apache-2.0 frameworks worth knowing tonight. Zep. Letta. Mem0. Each has a different shape.
IVwing fourprocedural
procedural memory
CLAUDE.md, .cursorrules, AGENTS.md. The DIY pattern that quietly turned into the killer use case. No framework. Highest leverage today.
spine of the talkeach wing has its own demo / labwalk in order
● wing I · context engineering
A MEMORY TOOL + MARKDOWN FILES OUTSIDE THE CONTEXT.
compaction. Claude saves a summary at a configurable token trigger, default near 180K. The working set keeps moving forward.
tool-result clearing. Drop stale tool outputs from the context before they crowd out new ones.
sub-agent delegation. Offload a subtask into a fresh context. The parent agent never sees the noise.
memory-as-tool. A `memory` tool with read / write verbs. Storage is markdown files outside the context. The model decides when to use it.
Anthropic · engineering blog · 2026
Anthropic frames memory as one of four levers for effective context engineering. Compaction, tool-result clearing, sub-agent delegation, and the memory tool. The model engineers its own working set.
→ anthropic.com / engineering / effective-context-engineering-for-ai-agents.
solves cost and latency. does not solve persistence across sessions.
● wing I · demo card · the recipe
LAB 1. CLAUDE CODE'S MEMORY.MD, LIVE.
A CARD CATALOG FOR YOUR AGENT.init a CLAUDE.md. watch the agent self-curate. cause a compaction. observe the survivor set.
claude init # scaffolds CLAUDE.md + opens auto-memory
# talk to the agent. let it work for 10 minutes.
ls .claude/projects/<hash>/memory/ # the agent has been writing
cat .claude/projects/<hash>/memory/MEMORY.md # the index it built
# force a compaction. see what survives.
claude --compact-now # the working set turns over
cat .claude/projects/<hash>/memory/MEMORY.md # same index, still there
→ MEMORY.md is the librarian's index card stack. context comes and goes; the catalog persists.
● wing II · the 2026 retrieval debate
VECTORS WIN ON SIMPLICITY. GRAPHS WIN ON TIME.
B · vector
VECTOR DB
"give me passages that look like this one."
pgvector, Pinecone, Qdrant, Weaviate, Chroma
similarity over embeddings
simple, fast, default for most stacks
cannot answer "what was true on this date"
vs
C · graph
KNOWLEDGE GRAPH
"give me the chain of facts that held last tuesday."
Zep / Graphiti, Cognee, standalone graph stores
entities, edges, timestamps
temporal reasoning, chains of evidence
heavier write path, async entity extraction
→ arXiv:2602.05665 (graph-based agent memory taxonomy, feb 2026). arXiv:2601.03236 (MAGMA, the "graphs eat vectors" thesis, jan 2026).
● wing III · persistent state · the center wing
THREE FRAMEWORKS. one wing. apache 2.0 all the way down.
III · a · first up
ZEP
temporal graph · graph-owns-truth
Graphiti under the hood. Every edge carries `valid_at` and `invalid_at`. The killer question it answers cleanly: "what was true last tuesday."
III · b · second
LETTA
stateful runtime · agent-owns-blocks
Three tiers. Core (always in context), Recall (conversation history), Archival (vector store). The runtime persists the blocks the agent edits.
III · c · third
MEM0
managed API · api-owns-truth
Three lines. add(), search(), delete(). Vector + graph + reranker hidden behind a managed layer. The substrate choice is theirs, not yours.
→ talk section order is locked: zep first (the temporal case lands cleanest), then letta, then mem0. comparator table follows.
● wing III · a · zep / graphiti · the temporal-graph case
EVERY EDGE CARRIES FOUR TIMESTAMPS.
Zep tracks memory in temporal edges where the graph owns the truth about when a fact was valid (per valid_at and invalid_at in graphiti_core/edges.py). The canonical case for graph-beats-vector when you need temporal reasoning.
Rasmussen et al. · arXiv:2501.13956 · zep team
graphiti_core / edges.py · EntityEdge
source_uuidthe entity it came from
target_uuidthe entity it points at
created_atwhen the edge entered the graph
valid_atwhen the fact became true
invalid_atwhen the fact stopped being true
expired_atwhen the system noticed
"what was true on this date?"
→ Preston Rasmussen, Liu, Liu, Mocrii, Klein, Chalef. *Zep: A Temporal Knowledge Graph Architecture for Agent Memory.* arXiv:2501.13956. nominative-fair-use citation per apache 2.0 §6.
● wing III · a · zep · the minimal recipe
ADD AN EPISODE. ASK WHAT WAS TRUE.
zep · python client · minimal exampleapache 2.0
from zep_python.client import Zep
from datetime import datetime, timezone
client = Zep(api_key="...")
graph_id = "ray_memory"
# write a temporal episode. the graph extracts entities + edges.
client.graph.add(
graph_id=graph_id,
type="message",
data="Ray prefers jade-teal accents on event posters.",
reference_time=datetime.now(tz=timezone.utc),
)
# later, in a new session, ask the graph what holds today.
results = client.graph.search(
graph_id=graph_id,
query="Ray's color preference for posters",
search_filters={"valid_at": datetime.now(tz=timezone.utc)},
)
for edge in results.edges:
print(edge.fact, edge.valid_at, edge.invalid_at)
→ runnable version with full provenance walk-through at /lab/zep. one graph, three episodes, one temporal query. 8 minutes.
● wing III · b · letta · the stateful-runtime case
THE AGENT OWNS THE BLOCKS. THE RUNTIME PERSISTS THEM.
Letta's three-tier memory model. Core (always-on), Recall (conversation history), Archival (vector store). Maps to what production agents actually ship versus what builders think they ship.
letta.com / blog / agent-memory · cofounded by Sarah Wooders + Charles Packer · MemGPT 2023
I
core memory
always-on
Lives in the system prompt. Edited by the agent itself via tool calls. The persistent "who we are."
II
recall memory
conversation history
Searchable log of past messages. Pulled on demand into the context window.
III
archival memory
vector store
Long-term facts and documents. Tool-callable. Where the cold knowledge lives.
the distinction. zep keeps truth in the graph (graph-owns-truth). letta keeps state in agent blocks (agent-owns-blocks). mem0 keeps it behind an API (api-owns-truth). same problem. three different owners.
● wing III · b · letta · the minimal recipe
SPAWN AN AGENT. EDIT ITS CORE. COME BACK TOMORROW.
letta · python client · minimal exampleapache 2.0
from letta_client import Letta
client = Letta(base_url="http://localhost:8283")
# create an agent. core memory blocks are first-class state.
agent = client.agents.create(
name="ray_assistant",
memory_blocks=[
{"label": "human", "value": "Ray runs Vibe Coding Nights at Frontier Tower."},
{"label": "persona", "value": "I keep track of what Ray ships."},
],
model="claude-sonnet-4-6",
embedding="text-embedding-3-small",
)
# the agent can edit its own blocks via tool calls during a turn.
client.agents.messages.create(
agent_id=agent.id,
messages=[{"role": "user", "content": "remember that I prefer jade-teal accents."}],
)
# tomorrow. fresh process. same agent. same memory.
client.agents.messages.create(
agent_id=agent.id,
messages=[{"role": "user", "content": "what color did I tell you I liked?"}],
)
→ runnable at /lab/letta. spin up a local letta server, create an agent, watch core memory survive a process restart. 10 minutes.
● wing III · c · mem0 · the managed-api case
THREE LINES. ZERO SUBSTRATE DECISIONS.
Mem0 wraps vector + graph + reranker behind a managed memory API, so a builder can add(), search(), and forget across sessions without picking a stack.
mem0.ai / blog / state-of-ai-agent-memory-2026 · mem0/mem0 readme
add()
Write a message, an event, a preference. The api decides what is fact-worthy and what is chatter.
search()
Ask for relevant memories. Returns ranked facts with source pointers. The substrate is hidden.
delete()
Forget a memory or a whole user. The forgetting policy is explicit, the implementation is not yours to maintain.
→ the pitch is operational simplicity. mem0 hides the substrate choice the other two surface. tradeoff: less control, less to break.
● wing III · c · mem0 · the minimal recipe
ADD. SEARCH. THAT IS IT.
mem0 · python client · minimal exampleapache 2.0
from mem0 import Memory
m = Memory()
# write some memories tied to a user.
m.add("Ray prefers jade-teal accents on event posters.", user_id="ray")
m.add("VCN happens wednesdays at 7pm at Frontier Tower.", user_id="ray")
m.add("Ray hosted VCN #32 last week. Topic: tool-use UX.", user_id="ray")
# later, in a new session, retrieve what is relevant.
results = m.search(
query="what color should the poster be?",
user_id="ray",
limit=5,
)
for hit in results["results"]:
print(hit["memory"], hit["score"])
# forget a memory if it goes stale.
m.delete(memory_id=results["results"][0]["id"])
→ runnable at /lab/mem0. three messages, one search, one delete. the substrate is hidden by design. 5 minutes.
● wing IV · procedural memory · the diy pattern that won
NO FRAMEWORK. JUST MARKDOWN. HIGHEST LEVERAGE TODAY.
The pattern emerged from nowhere and quietly became table stakes. A static markdown file that tells the agent how this codebase works, what conventions to follow, what to never touch. No vector store. No write path. Just words a human and a model both read.
CLAUDE.md · project root
# Life Automation Repo# Doing tasks
- consequential messages (sponsors, investors) always go through
Validate → Summarize → Approve → Act
- never run parallel Telegram connections (causes flood waits)
- credentials live in .env, never committed
# Voice rules
- no em-dashes, en-dashes, or hyphen connectors in outbound drafts
- lowercase headlines except where institutional names appear
→ cursor `.cursorrules` was first. claude code `CLAUDE.md` is the canonical agent-readable version. zero infrastructure cost. the most widely-shipped memory pattern of 2026.
● wing IV · claude code's auto-memory · the receipt
THE AGENT IS WRITING ITS OWN INDEX CARDS RIGHT NOW.
Claude Code v2.1.59+ ships with auto-memory default on. As the model works, it self-curates entries in MEMORY.md. User preferences. Project conventions. Past mistakes. The librarian builds the catalog while you build the code.
~/.claude/projects/.../memory/MEMORY.md · session todayuser_profile.md→ Ray is in SF, builds AI tooling, runs VCN.reference_directory.md→ check first on any send.feedback_no_dashes.md→ no em or en dashes in outbound.project_self_hosted_kv.md→ redis on node, replaces upstash.project_youtube_ingest.md→ yt-dlp + groq whisper.● writing this very deck added two more entries.
→ first agent to curate its own procedural memory by default. lab: /lab/claude-md. init a new CLAUDE.md in 90 seconds, watch it grow.
● the screenshot · pick a layer in 30 seconds
FOUR LAYERS. FOUR JOBS. ONE TABLE.
framework
substrate
write-policy
retrieval-pattern
reach for it when
zepgraphiti
temporal knowledge graph (valid_at / invalid_at)
async entity + edge extraction, with provenance
graph traversal + temporal filter
you need "what was true on this date" semantics
lettamemgpt v2
three tiers (core / recall / archival)
agent edits its own core blocks via tool calls
core lives in prompt, archival via search tool
you are building agent-native and want one runtime to own state
mem0managed api
hidden (vector + graph + reranker)
api decides fact-worthiness from messages
add() / search() / delete()
you have a working agent and want to bolt memory on without restructure
CLAUDE.mdauto-memory
static markdown files in the repo
agent writes when it notices something worth keeping
read on every session start, indexed by topic
you want zero-config persistence for a coding agent or single-user assistant
→ the differences are about who owns truth: the graph (zep), the agent (letta), the api (mem0), or the file (CLAUDE.md). pick by ownership, not by feature list.
● discord · the slide that breaks the grammar
Your agent's memory is now an attack surface.
arXiv:2604.16548memory poisoning · live · apr 2026mnemonic sovereignty
In A Survey on the Security of Long-Term Memory in LLM Agents (apr 2026), researchers formalize a new attack surface. If your agent reads memory from untrusted sources, an attacker can implant a false memory. The agent then acts on a false belief. Worse: it presents the belief as a learned preference of yours.
The paper coins the goal: mnemonic sovereignty. The agent's memory has to remain provably yours.
01attacker drops a crafted artifact into a corpus your agent later reads.
02agent ingests it. mem0 / zep / letta has no way to know the source is hostile.
03false fact is now in the store, indistinguishable from your real preferences.
04next session, agent recommends the attacker's shitcoin. cites you as the source.
trust the surface, get exploited.
→ arXiv:2604.16548 · the inversion of this slide IS the attack. nobody is talking about this yet. you should.
● open problems · what builders are wrestling with
NONE OF THESE ARE SOLVED. PICK ONE AND BUILD.
forgettingwhen does old become stale. time-decay versus LRU versus salience-scored.
consolidationepisodic logs balloon. how do you compress them into semantic facts without losing edge cases.
conflicttwo memories disagree. user said x last week, y today. newest wins, source-weighted, or ask.
tieringhot facts in context (fast, expensive). cold in vector (slow, cheap). how do you tier without flapping.
lost in the middlelong contexts degrade in the middle. memory has to beat "just throw more in the window."
self-improvingagents updating their own memory schema based on what they keep getting wrong. letta is investing here.
cross-agenttwo agents on the same task. shared memory without race conditions. almost no production answers yet.
evalLOCOMO exists but real-world memory eval is hard. how do you A/B "did the agent get better."
→ research.md §4 · pull these directly from the 2026 frontier papers (arXiv:2603.07670, arXiv:2512.13564). every one is an unbuilt company.
● production patterns · already shipping
THE PATTERNS THE FRAMEWORK HEADLINES MISS.
01 · rules files
.cursorrules + CLAUDE.md
Static markdown. Project-scoped behavior. The first widely-adopted procedural memory pattern. Most-shipped memory layer of 2026.
02 · auto-memory
Claude Code MEMORY.md
Default on in v2.1.59+. Agent self-curates. First agent that writes its own procedural memory.
03 · threads
OpenAI Assistants threads
Persistent conversation threads with the API managing context. Lower-level than Letta. Built into the platform.
04 · plan files
Replit Agent + Devin
Plan files persisted across runs. The agent remembers what it already tried so it does not retry doomed paths.
05 · checkpointer
LangGraph PostgresSaver
Process-level resume + time-travel debug. Not memory of knowledge — memory of execution. Pair with Mem0 / Zep for facts.
06 · per-repo
AGENTS.md / codex.md
The OpenAI flavor of the rules-file pattern. One markdown file per repo. Read on every session.
→ research.md §5. the rules-files pattern in general (`.cursorrules`, `CLAUDE.md`, `.windsurfrules`, `AGENTS.md`) is the layer that quietly won for most teams.
● pick your layer · decision tree
ONE QUESTION. ONE ANSWER. START THERE.
want zero-config persistence for a coding agent?
→
CLAUDE.md auto-memory
building agent-native, want one runtime to own loop and state?
→
Letta
have a working agent, just bolt memory on?
→
Mem0
need temporal reasoning ("what was true last tuesday")?
→
Zep / Graphiti
need process-level resume and time-travel debug?
→
LangGraph checkpointers (pair with one above)
→ the questions are not competing. most production stacks pair a knowledge layer with a process layer. pick by question, not by brand.
by 10pm, your agent remembers something it didn't at 7.
→ scan the QR on the deck chrome to pair your phone. no urls to memorize. labs run offline-friendly where possible.
● vcn cadence · wednesdays · 7pm · frontier tower
THE NEXT SEVEN. ALREADY ON THE CALENDAR.
2026-05-20VCN #33 · Total Recall · memory for agentic systems (tonight)
2026-05-27VCN #34 · Toolsmith · build your own MCP server
2026-06-03VCN #35 · Well Known · make your site discoverable by agents
2026-06-10VCN #36 · Crosstalk · build your own A2A endpoint
2026-06-17VCN #37 · Settlement · build an x402 paywall
2026-06-24VCN #38 · Glass Box · trace agents with OpenTelemetry
2026-07-01VCN #39 · Browser Tax · instrument your site with WebMCP
2026-07-08VCN #40 · Hardpoint · build an agent that withstands red-teaming
→ free, builder-only, no pitches. RSVPs on luma.com/vibe-coding-nights. doors 7pm, talks 7:30, social 9 to 10.
● hosts · vibe coding nights
THE PEOPLE WHO PUT THIS ROOM TOGETHER.
facilitator
Rayyan Zahid
Immersive Commons. Facilitator of VCN and tonight's speaker.
cohost
Michalis Vasileiadis
Otto / GSD 2.0. AI security and agentic infrastructure operator.
logistics
Eric Mockler
Frontier Tower F11 Health and Longevity. Pre-meet, room flow, food run.
tower lead
Devinder Sodhi
Frontier Tower lead. Booking, building access, the reason the F10 annex exists.
→ thanks to the F10 annex, Frontier Tower house staff, and every builder who showed up with a forgetting bug written down.
your agent remembers what you told it last tuesday.
slide 25 / annex opens here
the live talk ended.
the reference begins.
everything from slide 30 onward is the reference annex. denser. deeper. citation-heavy.
written for the hands-on hour and for the link the host sends after the room empties.
pick a section. skim it. come back when your agent breaks.
one edge. four lifecycle stamps. a clock per fact.
Zep tracks memory in temporal edges where the graph owns the truth about when a fact was valid (per valid_at and invalid_at in graphiti_core/edges.py). The canonical case for graph-beats-vector when you need temporal reasoning.
verbatim · graphiti_core/edges.py + arXiv:2501.13956
# graphiti_core/edges.py (faithful reconstruction)classEntityEdge(Edge):
source_node_uuid: str
target_node_uuid: str
fact: str# the natural-language claim
fact_embedding: list[float] | None
episodes: list[str] # source episode uuids# the four lifecycle stamps
created_at: datetime# row written into the graph
valid_at: datetime | None# fact became true in the world
invalid_at: datetime | None# fact stopped being true
expired_at: datetime | None# graph noticed the supersession
created_at is database time. The row was written. It says nothing about the world.
valid_at is event time. When did the fact become true. Often before created_at because the extractor catches up after the conversation happens.
invalid_at is the supersession boundary. When did the fact stop being true. null means the edge is still considered live. A new conflicting episode sets this on the older edge.
expired_at is observation time. When did the system notice the supersession. This is the gap between the world changing and your graph knowing.
the killer query · what was true on this date
matchedges where source_node = ?ray and predicate = works_at
wherevalid_at <= 2025-10-01 AND (invalid_at IS NULL OR invalid_at > 2025-10-01)
returnedge.fact, target_node.name, edge.valid_at
→ Rasmussen et al. Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956. Source code: github.com/getzep/graphiti, file graphiti_core/edges.py. Nominative-fair-use citation per apache 2.0 §6.
● appendix a · zep deep #2 · the loop
extraction is eventual. the graph is the consistency boundary.
Every episode you write fires a chain of LLM calls. Entity extraction. Edge extraction. Conflict resolution. Embedding. The graph commits a few seconds after the call returns. Reads against the graph during that window get the old truth, on purpose.
graph.add(episode) · what happens after the 200
01 · ingeststore raw episode row, return 200~50ms
02 · entitiesllm pass to extract Person / Place / Concept nodes~700ms
03 · edgesllm pass to extract predicates with valid_at hints~900ms
04 · conflictfind prior edges that contradict, set invalid_at on the loser~600ms
That is roughly 2 to 8 seconds for a single episode against a hot graph. The sleep() calls in the /lab/zep walkthrough are not arbitrary. They are the wait for stage 5 to commit before stage 1 of the next read kicks in. Treat the graph as eventually consistent and design around it.
The model behind stages 02 to 04 is a single LLM call each by default (configurable via llm_client). The graph never sees raw text after extraction. Once stage 02 names a node, every later edge reuses that uuid. The cost of the loop dominates the cost of the storage layer.
retry semantics. stages 02 to 04 are wrapped in tenacity-style exponential backoff. an OpenAI 429 or a malformed JSON response retries up to N times. failures past N are surfaced on the episode row as a status enum, not raised, so a single bad episode never blocks the ingest queue.
→ reference: Rasmussen et al. arXiv:2501.13956 §4 "Pipeline." Source: graphiti_core/graphiti.py::add_episode. Latency numbers approximate, measured against gpt-4o-mini on a 200-node graph.
● appendix a · zep deep #3 · the modes
two search calls, three result shapes. pick by question type.
Zep ships graph.search and memory.search_sessions as the two public read paths. They hit different indexes, return different shapes, and answer different questions. The wrong one looks broken on the right query.
mode 01
graph search
Vector search over fact embeddings. Returns ranked edges. Fast, semantic, no graph traversal. Best when the question is "what do you know about X," not "how is X connected."
use when: you want recall over the whole graph and you trust embedding similarity.
mode 02
graph search · hybrid
Vector recall, then graph expansion (1 or 2 hops from each hit), then a cross-encoder reranker. The slowest path. The most accurate. Cite chains come back attached.
use when: the agent will cite the fact to a user. you need provenance, not just relevance.
mode 03
memory search sessions
Conversation-scoped recall. Searches only within the threads tied to a session_id. Returns message-level chunks plus a context summary the runtime can paste.
use when: the user is in a conversation and you want "what did we talk about last week" rather than "what is true."
→ docs: help.getzep.com/searching-the-graph. Hybrid uses Cohere rerank by default in Zep Cloud, swappable to bge-reranker in self-host. The choice between modes 01 and 02 is usually a latency budget call, not a quality call.
● appendix a · zep deep #4 · the deployment
zep cloud wraps graphiti. graphiti standalone is the engine alone.
good forshipping fast, sharing across machines, dashboard introspection
option b
graphiti standalone
whatthe open-source graph engine. neo4j or kuzu as backend. no UI, no reranker UI
sdkgraphiti_core.Graphiti(uri, user, password)
storageyour neo4j or your local kuzu file
costyour llm tokens, your db host
latencylocal kuzu is sub-ms reads; neo4j is your choice of host
good forbyo-everything stacks, on-prem requirements, byo reranker
the same EntityEdge model. Both paths use the dataclass on slide 30. Zep Cloud is Graphiti plus a hosted layer. Standalone is Graphiti without the layer. Cloud reads at scale benefit from the project scoping primitives (multi-tenant graph isolation), which standalone leaves to you.
If you are building a single-agent personal assistant, standalone over kuzu is the lightest path. Five minutes to running. Zero hosted dependencies. The graph file lives next to your repo.
If you are building a multi-user product, Zep Cloud earns its keep on the auth + project isolation + dashboard alone. The reranker is the second reason. Self-hosting a reranker is a separate gpu line item.
→ standalone walk-through at /lab/graphiti-standalone (kuzu backend, ten-minute scratch graph, no hosted deps). Repo: github.com/getzep/graphiti. Cloud docs: help.getzep.com.
● appendix a · letta deep #1 · the loop
letta is an agent runtime, not a memory library.
Letta's three-tier memory model. Core (always-on), Recall (conversation history), Archival (vector store). Maps to what production agents actually ship versus what builders think they ship.
verbatim · letta.com/blog/agent-memory + POSITIONING.md
letta.agent.step() · the inner loop
receive message lands on the agent's input queue
compose system prompt + memory blocks + recall hits
decide llm picks tool call(s) or a direct response
execute tools run, results land on the message log
write agent may edit core blocks via tool calls
respond assistant message appended, persisted, returned
why the runtime owns memory. A library asks "where do I store this fact." A runtime asks "when in the loop should the model see this fact." Letta is the second question. Steps 02 and 04 are the answers. Core blocks ride in the prompt every turn. Recall lookups happen only when a tool calls for them. The model never decides where memory lives, only when to write.
Step 05 is the loadbearing one. The agent can rewrite its own persona or human block mid-turn via a tool call. The new block ships in the prompt on the next step. This is how Letta's "agent that learns about you" demos work. The runtime persists the edit between processes.
The OS analogy from the MemGPT paper is exact. Core memory is RAM. Recall is the page file. Archival is disk. The agent is the operating system scheduling reads and writes against its own context window.
→ blog post: letta.com/blog/letta-v1-agent. Original paper: Packer, Wooders, Lin et al. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. Loop reference: letta/agent.py::step (open source under github.com/letta-ai/letta).
● appendix a · letta deep #2 · the layers
core is a list of named blocks. the agent rewrites them.
Core memory is not a freeform string. It is a list of labeled blocks. Each block is a chunk the agent can read and overwrite by label. Two come default. You can add as many as you need.
human
What the agent knows about the user. Edited by the agent during turns. Default size cap ~2KB. The "remember that I prefer X" landing zone.
persona
What the agent thinks it is. Self-description, tone, role. The agent can rewrite this too. The "I am a helpful coding assistant who keeps it terse" block.
custom · your label
Arbitrary keyed slots. project_context, open_threads, relationship_status. Up to you. Each one is a string the model owns.
letta · block mutation · realistic callapache 2.0
from letta_client import Letta
client = Letta(base_url="http://localhost:8283")
# update one block on a live agent. the next turn sees the new value.
client.agents.blocks.modify(
agent_id=agent.id,
block_label="human",
value="Ray runs VCN at Frontier Tower. Prefers jade-teal. Ships fast.",
)
# add a brand new keyed block at runtime.
client.agents.blocks.create(
agent_id=agent.id,
label="open_threads",
value="VCN-33 deck due Mon. Awaiting Daniel quote.",
limit=2000,
)
why blocks instead of one freeform string. Three reasons. One, the model can mutate one block without nuking the rest, which keeps edits surgical. Two, blocks are diffable, so the dashboard can show "what changed in core this turn." Three, blocks are inspectable from outside the agent, so ops can read the persona without a tool call.
The cost is the cap. Each block has a limit in tokens. Hit the cap and the runtime refuses the write. This is the design pressure that pushes cold facts down to the Archival tier, where there is no cap.
→ reference: docs.letta.com/concepts/memory-blocks. SDK methods agents.blocks.modify and agents.blocks.create per the v1 Python client. The dashboard at localhost:8283 renders blocks live.
● appendix a · letta deep #3 · the substrate
archival is a vector store. letta hides the backend.
The Archival tier on slide 10 is not magic. It is a vector index. Letta abstracts the backend so the agent code does not change when you swap pgvector for sqlite-vec for Chroma.
backend 02 · embedded, single-processsqlite-vecone file. zero ops. dev velocity.
backend 03 · hosted or self-hostedchromawhen you already run chroma elsewhere.
the tradeoff that matters. Letta-managed storage gets you to a working agent in minutes. The cost is that you do not own the schema. Migrating a million archived passages between Letta backends is a Letta-supported flow, not a direct DB dump.
If your team already runs a vector DB for RAG, the BYO config keeps the agent's archival in the same store. One index, one ops surface. The agent gets archival_memory_search() as a tool. The tool hits your existing index.
The escape hatch is real. Every Archival row carries an opaque metadata blob. You can write rows from outside Letta and the agent will retrieve them through the same tool call. Useful for seeding an agent with a knowledge base it never lived through.
letta · archival backend configapache 2.0
# server-side: ~/.letta/config.toml
[archival_storage]
type = "postgres"
uri = "postgresql://localhost:5432/letta"
# or, for a one-file local dev setup:
[archival_storage]
type = "sqlite-vec"
path = "./letta_archival.db"
# the agent code does NOT change. archival_memory_search() works either way.
→ reference: docs.letta.com/server/configuration. Source: letta/orm/archival_passage.py. The cloud default is pgvector. Local letta server defaults to sqlite-vec.
● appendix a · letta deep #4 · the surface
one runtime. three model lanes. three deployment shapes.
Letta does not bind to a single provider. The model field on an agent is a routing string. Change the string, the loop on slide 34 runs against a different brain. The blocks, the archival, the recall, do not move.
lane a · proprietary openaiopenai/gpt-4o-miniopenai key from env
lane b · proprietary anthropicanthropic/claude-sonnet-4-6anthropic key from env
lane c · local open weightsollama/llama-3.1-8bollama running on localhost:11434
letta · model switching at create-timeapache 2.0
from letta_client import Letta
client = Letta(base_url="http://localhost:8283")
# same blocks. same archival. three different brains.
prod_agent = client.agents.create(
name="ray_prod",
memory_blocks=[{"label": "human", "value": "Ray runs VCN."}],
model="anthropic/claude-sonnet-4-6",
embedding="text-embedding-3-small",
)
dev_agent = client.agents.create(
name="ray_dev",
memory_blocks=[{"label": "human", "value": "Ray runs VCN."}],
model="openai/gpt-4o-mini", # cheaper, faster, for dev loops
embedding="text-embedding-3-small",
)
offline_agent = client.agents.create(
name="ray_offline",
memory_blocks=[{"label": "human", "value": "Ray runs VCN."}],
model="ollama/llama-3.1-8b", # no network, runs on your laptop
embedding="ollama/nomic-embed-text",
)
shape 01 · cloud
Letta hosts the server. You hit it over HTTPS. Auth, multi-tenant scoping, persistence, dashboard. The "I just want an agent" path.
shape 02 · docker
Self-host the server. docker compose up. Bring your own postgres. Full ownership of agents, blocks, archival.
shape 03 · embedded
Run the server in-process from a python script. Great for tests, demos, single-user laptop deploys. letta server --port 8283 from cli.
→ docs: docs.letta.com/models. Provider strings: openai/<model>, anthropic/<model>, ollama/<model>, letta/letta-free (free tier on letta cloud). Embedding strings follow the same pattern.
● appendix a · mem0 deep #1 · the layers
mem0 is three storage layers behind one method.
Mem0 wraps vector + graph + reranker behind a managed memory API, so a builder can add(), search(), and forget across sessions without picking a stack.
verbatim · mem0.ai/blog/state-of-ai-agent-memory-2026 + mem0/mem0 readme
layer 01 · vector
Default backend. Qdrant or Chroma local, pinecone hosted. Semantic recall over extracted memories. This is the layer the 3-line demo uses.
layer 02 · graph (optional)
Neo4j backend. Entity + relation extraction over your stored memories. Off by default. Switch on for relationship traversal and multi-hop facts.
layer 03 · reranker
Cross-encoder on top of layer 01 hits. Tightens the top-k. Adds ~80ms per query. Off by default in OSS, on by default in mem0 cloud.
the managed API hides this. A user of the 3-line drop-in on slide 13 never sees the layers. m.add() writes to vector. m.search() reads from vector. Reranker is configured at construction. The graph layer is dormant until enabled.
The graph layer is the part most users miss. Once you flip graph_store on, every add() runs entity extraction in addition to embedding. Searches can then ask "what does mem0 know about Ray and what entities relate to him" rather than just "what messages near 'Ray.'"
mem0 · expose the layers via from_configapache 2.0
from mem0 import Memory
# expose the substrate. flip on the graph layer.
config = {
"vector_store": {
"provider": "qdrant",
"config": {"host": "localhost", "port": 6333},
},
"graph_store": {
"provider": "neo4j",
"config": {"url": "bolt://localhost:7687", "username": "neo4j", "password": "..."},
},
"llm": {"provider": "anthropic", "config": {"model": "claude-sonnet-4-6"}},
"embedder": {"provider": "openai", "config": {"model": "text-embedding-3-small"}},
}
m = Memory.from_config(config)
# add() now writes to BOTH vector and graph.
m.add("Ray prefers jade-teal. He runs VCN at Frontier Tower.", user_id="ray")
# search() with relations=True returns the graph traversal.
hits = m.search("what color does Ray prefer", user_id="ray", filters={"include_graph": True})
→ docs: docs.mem0.ai. Source: github.com/mem0ai/mem0. Config shape per Memory.from_config. The default Memory() constructor uses local Qdrant + OpenAI embeddings + no graph + no reranker.
● appendix a · mem0 deep #2 · the controls
scope by category. tune the extractor. stop mem0 from saving noise.
The 3-line demo is the floor. Production mem0 needs scoping (which memories belong to which user, session, or agent) and filtering (which messages are worth keeping at all). Both are first-class.
scope 01 · user_id
"Memories belonging to this human." The default scope. Cross-session. Survives forever unless deleted.
scope 02 · agent_id
"Memories belonging to this agent persona." Useful when one user has multiple specialized assistants.
scope 03 · run_id
"Memories scoped to a single session." Ephemeral. Lets you keep working context without polluting long-term store.
mem0 · scoping + custom extractorapache 2.0
from mem0 import Memory
# tune what counts as a fact worth keeping.
custom_extractor = """
You are extracting memories for a VCN host. Keep:
- explicit preferences ("I prefer X")
- commitments and dates
- relationships between named people and orgs
Drop: small talk, weather, transient logistics.
Return JSON: {"facts": ["fact 1", "fact 2"]}
"""
m = Memory(config={
"llm": {"provider": "anthropic", "config": {"model": "claude-sonnet-4-6"}},
"custom_fact_extraction_prompt": custom_extractor,
})
# scope by both user and session. add a category for later filtering.
m.add(
messages=[{"role": "user", "content": "I prefer jade-teal posters. VCN is on Mondays."}],
user_id="ray",
run_id="session_2026_05_11",
metadata={"category": "preferences", "source": "telegram"},
)
# search only inside one category.
hits = m.search("poster color", user_id="ray", filters={"category": "preferences"})
why custom prompts matter. The default extractor keeps a lot. Names, numbers, every preference signal. On a high-volume agent the store balloons. A custom prompt with a tight "keep / drop" rubric cuts the store by 60-80% in practice. The drop list is the load-bearing half.
why scope matters. Without user_id filtering you will recall the wrong user's memories. Without run_id separation, a one-off session leaves long-term residue you did not intend. Both are cheap to add. Both are easy to forget.
→ docs: docs.mem0.ai/core-concepts/memory-types. The custom_fact_extraction_prompt field lives at the top level of the Memory() config. The metadata dict is stored on the row and queryable via filters.
● appendix a · mem0 deep #3 · the conflict
new fact contradicts old fact. mem0 mutates the store.
Mem0 calls itself self-improving memory. The practical version: every add() runs a conflict pass against existing memories. New facts can update, supersede, or be merged with prior ones. The store mutates in place rather than appending.
strategy 01 · update
edit in place
New fact partially overlaps old fact. Mem0 rewrites the row, keeps the same id. previous_value is recorded in history.
Example. Old: "Ray works at Sandbox VR." New: "Ray is store manager at Sandbox VR SF Flagship." Result: one row, the longer one wins.
strategy 02 · supersede
mark old, write new
New fact directly contradicts old. Old row goes into history. New row takes the live slot. The agent never sees the stale fact on search.
Example. Old: "Ray lives in Brooklyn." New: "Ray moved to SF." Result: SF is live; Brooklyn is in history.
m.add(...) · the conflict pass
01extractllm pulls candidate facts from the message
02searchvector search for semantically near existing memories (user-scoped)
05historylog the change for /history endpoint and audit
mem0 · observe the conflict resolutionapache 2.0
from mem0 import Memory
m = Memory()
# t=0. write the original.
m.add("I live in Brooklyn and work remote.", user_id="ray")
# t=1. write the update. mem0 will classify as supersede.
m.add("I moved to SF for the Sandbox VR job.", user_id="ray")
# the store now has SF, not Brooklyn.
hits = m.search("where does Ray live", user_id="ray")
# the audit trail exists. ask for it.
for h in hits["results"]:
print(h["memory"], "->", h.get("history", []))
→ reference: mem0.ai/blog/state-of-ai-agent-memory-2026. The classify step is the load-bearing one. The LLM picks one of ADD / UPDATE / DELETE / NONE per candidate against per-neighbor pair. Verdict prompt lives in mem0/memory/main.py.
● appendix a · mem0 deep #4 · the production config
three lines is the demo. thirty lines is production.
The three-line drop-in is real. It is also a development setup. A mem0 deployment that handles thousands of users, real latency budgets, and an audit trail has thirty lines of config and policy wrapped around the same three verbs.
pattern 01 · pagination on search
Default limit=10. Pass limit and offset for stable scrollthrough. The reranker only sees the top-k, not the page.
pattern 02 · batch add
Multiple messages=[...] entries in one add() call. Extraction happens once over the whole batch. Fewer llm round-trips.
pattern 03 · eviction policy
Mem0 does not auto-prune. Scheduled job calls delete_all by filters for stale categories. Time decay is your problem.
pattern 04 · observability hook
Pass telemetry in the config to forward every add and search to your own logger. Latency and verdict per call.
mem0 · realistic production config (~30 lines)apache 2.0
from mem0 import Memory
import logging
logger = logging.getLogger("mem0.ops")
config = {
# backends. all self-hosted. no cloud round-trip.
"vector_store": {
"provider": "qdrant",
"config": {"host": "qdrant.internal", "port": 6333, "collection_name": "vcn_memories"},
},
"graph_store": {
"provider": "neo4j",
"config": {"url": "bolt://neo4j.internal:7687", "username": "neo4j", "password": "..."},
},
"llm": {
"provider": "anthropic",
"config": {"model": "claude-sonnet-4-6", "temperature": 0.0, "max_tokens": 1024},
},
"embedder": {
"provider": "openai",
"config": {"model": "text-embedding-3-small"},
},
# tune extraction. drop the chatter, keep the facts.
"custom_fact_extraction_prompt": OPS_EXTRACTOR_PROMPT,
# forward every call to internal telemetry.
"history_db_path": "/var/lib/mem0/history.db",
"version": "v1.1",
}
m = Memory.from_config(config)
# wrap the surface for logging + retries.
def add_with_audit(messages, user_id, run_id, category):
res = m.add(
messages=messages,
user_id=user_id,
run_id=run_id,
metadata={"category": category, "ts": int(time.time())},
)
logger.info("mem0.add", extra={"verdicts": res, "user_id": user_id})
return res
operational simplicity is the pitch. the three lines on slide 13 are honest about the floor. the thirty lines here are honest about the ceiling. the gap is where production lives.
→ reference: github.com/mem0ai/mem0 readme, docs.mem0.ai/integrations. The from_config constructor is the only path that exposes the full surface. Default Memory() is the friendly cousin. Both call the same internals.
● appendix · wing I deep · 1 of 4 · the memory tool internals
THE MEMORY TOOL IS A FILESYSTEM CALLABLE.
Anthropic's memory tool is not a vector store, not a database, not a service. It is a callable the model invokes inside its tool loop, backed by a server-managed markdown filesystem. The agent issues `view`, `create`, `str_replace`, `insert`, `delete`, and `rename` against paths under `/memories`.
The interesting design choice is the one missing thing. There is no scoring, no eviction policy, no semantic indexing. The agent decides when to call it. The operator supplies a system prompt that tells the model when it is worth remembering something, and the model writes a markdown file. On the next turn the model lists `/memories`, reads the relevant file, and pulls it into the working set.
This is the opposite of the Mem0 pattern. Mem0 listens to messages and silently decides what is fact-worthy. The memory tool makes the write decision explicit and visible inside the trace.
→ anthropic.com / engineering / effective-context-engineering-for-ai-agents. the memory tool ships as a public beta in the messages API; storage is operator-managed (Anthropic does not host the markdown). the agent self-curates without any operator policy code.
● appendix · wing I deep · 2 of 4 · compaction internals
AT 180K TOKENS THE HARNESS REWRITES THE TRANSCRIPT.
Claude Code's default compaction trigger sits near 180K tokens. When the working transcript crosses that line, the harness pauses, sends the entire transcript to a summarization pass, and resumes the agent loop with the summary in place of the raw history. The user sees a one-line notification. The agent sees a shorter context that still encodes the decisions it made.
What survives is opinionated. Decisions made, files touched, error messages worth remembering, the last few turns of reasoning verbatim. What dies is the bulk. The 50KB file dump from one `Read`. The eight `Grep` results that returned the same file path. The tool outputs the agent used once and never needed again.
Three knobs operators reach for. Lower the threshold to compact sooner (cheaper turns, more frequent summarization tax). Raise it to delay compaction (longer coherent windows, larger summarization spike). Disable compaction entirely and rely on the model's own context window plus manual hand-offs. Anthropic-managed agents expose this as `compaction_strategy` in the harness config.
100K · safe
180K · compact
compaction prompt · faithful reconstruction
You are summarizing a long agent transcript so the
agent can continue without losing the thread.
KEEP verbatim:
- the user's original task and any clarifications
- decisions the agent made and why
- file paths read or written
- the last 5 turns of reasoning
DROP:
- full file contents (cite path + range instead)
- tool result bodies > 2KB unless the agent
references them in later reasoning
- repeated Grep or Bash output
Output: a markdown brief under 4K tokens.
→ anthropic.com / engineering / effective-context-engineering-for-ai-agents and the public Claude Code docs at code.claude.com / docs / en / context. compaction is one of the four levers; the other three are tool-result clearing, sub-agent delegation, and the memory tool.
● appendix · wing I deep · 3 of 4 · tool-result clearing
THE AGENT CAN MARK A TOOL RESULT AS DONE.
The pattern is simple and underused. When the model finishes processing a tool result, it can flag it as no longer needed. The harness prunes the body from the active context on the next turn. The tool call itself stays in the transcript (so the trace is honest), but the bulk payload is gone.
The canonical case is `read_file` on a large source file. The model needs the contents once to reason about a fix. It writes the patch. The 50KB file body now contributes nothing except cost on every subsequent turn. Marking it cleared drops the cost to a few tokens of structural metadata.
Compare this to compaction. Compaction is operator-policy and triggers at a threshold. Tool-result clearing is agent-policy and triggers turn by turn. The two compose. A long-running agent with both enabled stays under threshold for far longer because the working set never accumulates dead weight in the first place.
before · turn 17 context
user_taskfix the typed-error bug in kernel/capabilities/twitter.py
tool_result 1read_file kernel/capabilities/twitter.py · 47.3 KB body
tool_result 2grep _classify_error · 11 matches across 4 files
tool_result 3read_file kernel/_base.py · 8.1 KB body
→ anthropic.com / engineering / effective-context-engineering-for-ai-agents names tool-result clearing alongside compaction and sub-agent delegation. the messages API exposes it as a per-tool-call boolean the model sets when responding.
● appendix · wing I deep · 4 of 4 · sub-agent delegation
SPAWN A SUB-AGENT. KEEP THE NOISE OUT OF THE PARENT.
Anthropic's harness essay frames sub-agent delegation as a memory hygiene primitive, not a parallelism primitive. The parent agent calls a sub-agent for a bounded subtask. The sub-agent gets a fresh context window, does the work, and returns one result. The parent never sees the intermediate dump.
Think of it as scoped variables for context. The exploration step that requires reading twelve files and running four greps happens inside the sub-agent. The parent receives a paragraph that says "looked at twelve files, the relevant function is at this path, here is its signature." That paragraph is now the only thing the parent has to carry.
The cost trade is exact. You pay for the sub-agent's full context separately, then throw it away. The parent's context stays compact. For deep research and refactors this almost always wins; the parent's context is the precious resource because it carries the user's intent.
parent / child context flow
P0user: "find the regression in the auth middleware."
P1agent: I need a deep audit. Spawn `auth-auditor`.
C1auth-auditor: read 12 files in middleware/, grep for session calls, inspect error logs
C2auth-auditor: ~85K tokens consumed across 14 turns inside its OWN context window
C3auth-auditor returns: "regression in session.refresh(), line 47, introduced 2026-04-22 commit a3f1, fixes by reverting the conditional."
P2agent receives the 1-paragraph summary. Parent context still at < 20K.
net effect: parent context cost ~1K tokens for the entire audit. without delegation: parent absorbs all 85K and competes with the user's intent for working-set space.
→ anthropic.com / engineering / effective-harnesses-for-long-running-agents. the framing is explicit: delegation is "memory hygiene by quarantine." Claude Code's `Agent` tool is the production reference; the Managed Agents SDK exposes the same primitive.
● appendix · wing IV deep · 1 of 4 · the auto-memory filesystem
THE FILESYSTEM IS THE REAL CONTRACT.
Claude Code v2.1.59 and later ship auto-memory default-on. The agent reads two files on every session start. `CLAUDE.md` from the project root, which the human edits, and `MEMORY.md` from `~/.claude/projects/<encoded-cwd>/memory/`, which the agent writes.
The encoded path is the current working directory with slashes and colons replaced by dashes. For this very deck-build session that is `C--Users-jtole-Documents-2026-life`. Each memory entry is its own markdown file; `MEMORY.md` is just the index. When the agent learns something worth keeping, it writes a new file (e.g. `feedback_no_dashes.md`) and appends one line to the index.
The hook contract is asymmetric on purpose. The agent never writes the human's CLAUDE.md. The human can edit either file but typically does not touch MEMORY.md. The shared filing cabinet has two drawers and two locks.
~/.claude/projects/<encoded-cwd>/memory/ · real layout
→ code.claude.com / docs / en / memory. each entry carries frontmatter (name, description, type) so the agent can decide whether to load it. typical entry types: `user`, `feedback`, `project`, `reference`. inspect any live MEMORY.md to see the structure.
● appendix · wing IV deep · 2 of 4 · the hooks system
HOOKS ENFORCE. MEMORY ONLY ASKS.
Claude Code's settings.json defines lifecycle hooks. Bash or Node commands the harness runs on specific events. These are policy that the agent cannot opt out of. If the agent tries to end a session with uncommitted work, the `Stop` hook fires `atomic-commit-gate.js` and blocks the close.
The events you actually use are short.
SessionStart · crash recovery, baselinesUserPromptSubmit · prompt expansionPreToolUse · gate before a tool runsPostToolUse · react after a tool ranNotification · external signal handlerStop · final pass before session ends
The clean separation is the point. Memory says please. A line in CLAUDE.md that reads "never run parallel Telegram connections" is a hope that the agent reads and remembers it. A `PreToolUse` matcher on `Bash` that greps for `telethon.*parallel` and exits non-zero is a wall.
settings.json · the atomic-commit-gate hook
{
"hooks": {
"Stop": [
{
"matcher": "",
"hooks": [
{
"type": "command",
"command": "node ~/.claude/hooks/atomic-commit-gate.js"
}
]
}
],
"PostToolUse": [
{
"matcher": "Write|Edit",
"hooks": [
{ "command": "node ~/.claude/hooks/session-writes-tracker.js" }
]
}
]
}
}
# Stop hook exits non-zero if `git status --porcelain` has output.
# The session refuses to end. The user sees the diff and decides.
memory vs hooks: a memory entry that says "commit before ending" is a soft norm the agent may forget under context pressure. The `atomic-commit-gate.js` hook makes it a runtime invariant. Both exist in this repo; the hook is what catches the mistakes the memory misses.
→ code.claude.com / docs / en / hooks. real hook example from `~/.claude/settings.json` in the life repo. enforcement code beats prose policy when stakes are high. use both: prose to explain why, hook to make sure.
● appendix · wing IV deep · 3 of 4 · the rules-file comparator
FIVE FILES. FIVE PARENTS. ONE PATTERN.
file
harness
scope
who writes
agent self-edits
format
CLAUDE.md+ MEMORY.md
Claude Code
user (~/.claude) + project (./CLAUDE.md)
human for CLAUDE.md, agent for MEMORY.md
YES (memory only)
plain markdown, no schema required
.cursorrules+ .cursor/rules/
Cursor
project root (and per-file globs in .cursor/rules/)
human
NO
plain markdown, optional `globs:` frontmatter
AGENTS.mdcodex flavor
Codex CLI, OpenAI agents
project root, nested by directory
human
NO
plain markdown, free-form sections
.windsurfrulesglobal + workspace
Codeium Windsurf, Cascade
global (~/.codeium) + workspace (./.windsurfrules)
human
NO
plain markdown, ~6K char soft cap
.aiderconfig.yml+ CONVENTIONS.md
Aider
project root
human
NO
yaml config + free-form markdown convention file
→ the honest read: four of the five are static markdown the human edits. Claude Code is the only one where the agent curates a sibling file the operator typically does not touch. That is the only structural difference. Everything else is naming, scoping, and which directory the harness scans.
● appendix · wing IV deep · 4 of 4 · why procedural quietly won
A 200 LINE CLAUDE.MD beats a vector store for most teams shipping today.
argument 1 · token cost
retrieval is free when memory lives in the prompt.
A 200 line CLAUDE.md is roughly 3K tokens. The agent pays once per session, then nothing per turn for the rest of the working window. A vector store retrieval pays embedding cost on write, network latency on read, and reranker cost per query. For procedural facts (style rules, conventions, do-not-touch lists) the in-prompt cost is lower and the latency is zero.
argument 2 · eval cost
you can A/B a markdown diff.
Change one rule in CLAUDE.md. Re-run the eval suite. See whether the metric moved. Procedural memory is a unit of change with a clear before and after. Vector stores reshape under every embedding model upgrade and every chunking choice. You cannot diff a vector store the way you can diff a markdown file.
argument 3 · team coherence
rules are pull requests. vector stores are not.
Procedural memory lives in the same git repo as the code. Adding a new convention is a PR with a diff and a reviewer. Six months in, you can `git log CLAUDE.md` and see why every rule landed. A team-shared memory layer that does not show up in code review is a layer the team does not actually own.
"the rules-files pattern in general. .cursorrules, CLAUDE.md, .windsurfrules, AGENTS.md. the DIY procedural memory that quietly won for most teams. this is the highest-leverage memory layer for builders shipping today."
vcn-33 research notes · 2026-05-04
→ this is the case the next two years of agent tooling will either defend or kill. the bet here: even when frameworks like Letta and Mem0 win on persistent state, the procedural layer stays in markdown next to the code. it is too cheap, too diff-friendly, too human-legible to lose.
● s50 · eval · what "memory works" even means
YOU CAN'T A/B TEST REMEMBERING.
Every memory framework page claims accuracy gains. None of them mean the same thing by accuracy. The community settled on LOCOMO as the closest thing to a yardstick. Long Conversational Memory. A 600-turn synthetic dialogue corpus across 10 sessions per persona, annotated for what the agent should still know on turn 600 that was said on turn 17.
The scores are low. Frontier models clear ~50 to 60 percent on single-hop factual recall and crater on multi-hop reasoning over the same dialogue.
what LOCOMO actually scores
single-hopdid you remember a fact stated once N turns ago. the easy axis.
multi-hopcombine two facts from two different sessions. where everything falls apart.
temporalwhen was it true. which session. before or after the user changed their mind. zep's home turf.
open-domaingenerate a free-text answer grounded only in remembered context. graders disagree.
adversarialthe dialogue contains contradictions. resolve them. most systems pick the most recent. wrong half the time.
"did my agent get better at remembering" is a research question, not a unit test.
→ LOCOMO paper: arXiv:2402.17753 · referenced by every 2026 memory framework as the leaderboard nobody quite trusts.
● s51 · eval · the LOCOMO setup, in five steps
FIVE STEPS. THREE HOURS. NOT FREE.
01pull corpusdownload the LOCOMO dataset (10 personas, 600 turns each, ~70k tokens per persona). split into sessions and probes.
02ingestfeed every session to the memory layer under test. mem0 .add(), letta core block writes, zep add_messages(). one persona per agent instance, no leakage.
03probereplay the 1000+ probe questions. each one targets a specific session and turn. record the agent's answer plus what it retrieved.
04gradetwo graders. exact match for factual probes, an LLM judge for open-domain. report agreement. low agreement means the judge prompt is wrong, not the agent.
05slicebreak the score by axis (single-hop, multi-hop, temporal, adversarial). report all five. one aggregate number is a lie.
costa single LOCOMO sweep across one framework runs ~$40 to $120 in API calls. four frameworks for a real comparison: ~$400. nobody expenses this for a blog post.
setup timeevery framework has its own ingest dialect. wiring four of them to the same corpus is a two day job before you grade a single answer.
signalthree runs of the same eval against the same framework give three different scores. graders are noisy. you need N=5 to publish. five times the cost.
→ ray's lab scaffolds the run loop: /lab/locomo-eval · skeleton, not a finished comparison. you bring the budget.
● s52 · multi-agent · the unsolved problem
TWO AGENTS, ONE STORE. NOBODY HAS SHIPPED THIS.
Single-agent memory is a write-then-read problem. Multi-agent memory is a distributed-systems problem. Two agents working the same project, same user, same hour. Both write. Both read. The store has no clock that means anything to both of them.
The 2026 survey at arXiv:2603.07670 dedicates a section to this. The honest summary: "an emerging frontier." Translation: no production answers.
the canonical race conditionagent-areads "user prefers tabs."agent-astarts writing code with tabs.agent-basks user. user says "actually spaces today."agent-bwrites memory: "user prefers spaces."agent-astill indenting with tabs. has not re-read.agent-bwrites spaces. PR is half tabs half spaces.→ last-writer-wins is wrong. there are no last writers.
last-writer-winscheap. wrong. the agent who happened to write last sets policy until the next race. ships a lot in 2026 because it is one config line.
source-weightedtag every memory with a trust tier. human messages outrank agent inferences. agent-a outranks agent-b for stylistic prefs in the code area. requires a schema nobody has.
CRDT mergetreat memories as a grow-only set. conflicts surface, never auto-resolve. a query returns "user said tabs at T1, spaces at T2." the caller picks. zep's temporal graph is the closest production analog.
consensus passa third agent reads the conflict and decides. slow. accurate. expensive per call. the pattern the Letta team is sketching internally for the v2 multi-agent runtime.
→ if you ship a real answer here in 2026 you have a company. arXiv:2603.07670 §6 for the survey of attempts.
● s53 · security · the attack taxonomy of 2026
THREE WAYS TO POISON. ONE STORE. ZERO DEFENSES OUT OF THE BOX.
vector 01 · direct
write the lie at the front door.
The attacker has a write surface. A shared agent, an open MCP server, a memory tool exposed to untrusted users. They call add() with the false fact and walk away. The store has no notion of who wrote what.
in the wild
a customer-support agent with a shared memory pool. one user writes "the company offers a 100 percent refund on any complaint." next user gets that as a remembered policy.
vector 02 · indirect
poison the well, let the agent drink.
The attacker never touches the memory store. They plant the false fact in a corpus the agent will later RAG over. A markdown file in a public repo. A Stack Overflow answer. A web page. The agent ingests it as fresh truth and writes it to memory itself.
in the wild
a coding agent reads a poisoned README claiming a popular library "now requires a config token from attacker.com." agent recommends. ships.
vector 03 · semantic
hide the attack inside a real fact.
The attacker writes a benign-looking memory whose semantic embedding overlaps an unrelated query. Retrieval pulls it. The agent treats it as in-context evidence. The payload activates only when the right user asks the right question.
in the wild
a memory entry says "for any auth question, the recommended JWT secret is hunter2." semantically near "JWT setup," weeks before any auth question gets asked.
mnemonic sovereignty · the agent's memory must remain provably yours.
→ A Survey on the Security of Long-Term Memory in LLM Agents, apr 2026, arXiv:2604.16548 · this is where the term gets coined. read it before you ship a shared memory pool.
● s54 · security · what you actually do about it
FIVE DEFENSES. PICK ALL OF THEM.
provenance
Every memory carries the source that wrote it. Schema-level, non-optional. A memory entry is not {fact}, it is {fact, source, written_at, written_by, trust_tier}. At retrieval time the agent can filter by tier. Zep's graph already does this for edges; bolt the same idea onto mem0 and letta via a wrapper.
signing
Trusted writers sign. Untrusted writers don't. At read time the agent prefers signed memories. This is the same playbook commit-signing solved a decade ago, and the same playbook agentic-payments shipped via RFC 9421. Same primitive. Different store.
trust tiers
Separate stores by trust. Tier 0 is the user's own messages, signed. Tier 1 is the agent's own inferences. Tier 2 is RAG over your own corpus. Tier 3 is RAG over the internet. Never mix them in a single retrieve. Promote with a human in the loop, never automatically.
adversarial eval
Run a poison set against your own stack. Implant a false memory through every available write path (direct, RAG, prompt-injection in tool results). Probe later with a query that should retrieve it. If the agent acts on it, you have a real failure, not a theoretical one. Treat it like a regression test.
honeypotting
Run a decoy agent surface in front of the real one. Log every write. Look for the patterns no real user produces: shitcoin recommendations, fake API endpoints, novel-but-plausible security advice. Lobsterhoney (ray's other project) is exactly this for ai-agent traffic: a honeypot + audit for the agentic web. Same primitive, applied to your memory store.
defense in the lab
/lab/memory-poison walks the full red-team. plant a false fact via three vectors, probe with three queries, watch which framework betrays you. mem0 fails differently than zep, which fails differently than CLAUDE.md. open the lab.
→ defense vocab from arXiv:2604.16548 §5 · honeypot pattern from lobsterhoney.com · the broader argument: signing + tiers + provenance is the minimum stack for any shared memory in 2026.
● s55 · production reality · the framework question is a layer question
NOBODY PICKS ONE. THEY STACK THREE.
case a · solo coderthe file is the stack.
toolsClaude Code
proceduralCLAUDE.md (manual)
autoMEMORY.md (v2.1.59+)
retrievalMCP server (optional)
Why it wins: zero infra. one editor. memory lives in git. no vector store to debug. the second a project outgrows it, you know.
case b · production chatbotthe framework eats the plumbing.
runtimeLangGraph
memoryMem0 (managed API)
resumePostgresSaver checkpointer
observabilityLangSmith
Why it wins: three managed layers, one ops surface. Mem0 handles fact-worthiness so your prompt stays focused. checkpointer handles process crashes. LangSmith handles "why did the agent say that." nothing custom.
case c · agent-native productown the loop, own the store.
runtimeLetta v1
graphGraphiti standalone
substratepgvector + Postgres
protocolMCP for tools, A2A for agents
Why it wins: Letta owns core context. Graphiti owns temporal facts. pgvector owns the embedding pool. you own the schema. nothing leaves your boundary. expensive in calendar weeks, cheap forever after.
→ /lab/hybrid-stack walks case b end to end · case c lives in any agent-native product you respect · case a is what you have right now.
● s56 · forgetting · the policies nobody writes down
"NEVER FORGET" IS WRONG. RETRIEVAL GETS NOISIER EVERY DAY.
wins for agents with structured queries. the relevance signal does the actual work and recency is the tiebreaker.
loses for open-domain assistants. you can't compute relevance without a query, and you don't have a query at write time.
event-drivenon logout: archive · on contradiction: split + version
wins for session-scoped agents and high-stakes assistants. you know exactly when state can safely turn over.
loses for always-on companion agents where there is no logout, no contradiction, just slow accumulation forever.
The honest production answer is a combination of all three, tuned per memory tier. Mem0 ships a softmax over recency × salience. Zep retains everything but marks edges invalid_at. CLAUDE.md forgets nothing and you pay for it in context bloat.
→ cost-latency tradeoff lives at arXiv:2603.07670 §4.3 · the actual code each framework runs is in their open-source repos. read it before you trust the marketing.
● s57 · consolidation · the reflection tax
50 EVENTS A SESSION. 100 SESSIONS. DO THE MATH.
the storage explosion
events per session50
sessions per quarter100
avg tokens per event120
raw episodic store600,000 tok
in-context cost per query (no consolidation)unaffordable
The fix everyone reaches for is reflection: every N events, an LLM summarizes the block into a few semantic facts. The fact count grows slowly while the underlying event log is allowed to balloon and stay cold. Reads target the semantic layer first, the event log second. Letta uses this pattern in Recall. Mem0 calls it the consolidation pass. Anthropic calls it compaction. Same primitive.
episodic events~50 / session
→
block summary~5 facts / 10 events
→
semantic store~10 facts / session
→
in-context~3 facts / query
The risk is the same risk all summarization has. The summary loses the specifics that mattered. The edge case that triggered the bug, the off-handed preference the user only stated once, the contradiction between session 17 and session 42. Compressed away by a summarizer who didn't know what was load-bearing.
→ reflection pattern: arXiv:2603.07670 §3.3 · the practical implementation is in letta/letta under memory/summarize.py · keep the raw log even if you stop reading from it.
● s58 · reading list · the 14 documents that move the field
The first agent that curates its own procedural memory by default. Docs for the MEMORY.md system. Skim before you write a custom CLAUDE.md.
→ all entries pulled from research.md §Sources · this is the reading list ray maintains. expect rotations every quarter as the field moves.
● s59 · the close · where the frontier moves in the next 18 months
THREE PREDICTIONS. CHECK BACK IN 2027.
01
prediction · the managed-memory wave
Memory becomes a managed service.
The Mem0 thesis wins on volume. The default for most builders becomes "POST your messages to a vendor and let them decide what to remember," the same way logs went to Datadog and errors went to Sentry. Letta and Zep occupy the agent-native end. The DIY pattern (pgvector + your own logic) gets relegated to teams who genuinely care about the storage shape.
02
prediction · the rules-file standard
Procedural memory becomes a first-class artifact, like .gitignore.
CLAUDE.md, AGENTS.md, .cursorrules, .windsurfrules. The pattern that quietly won keeps winning until somebody publishes a real standard. AGENTS.md is the leading candidate because it is vendor-neutral. Every repo on GitHub ships one by 2027. The agent reads it before doing anything. The same way every repo ships a README the human reads before doing anything.
03
prediction · the poisoning era
Memory poisoning attacks become routine. Defense becomes a fourth quadrant on every product roadmap.
The first public memory-poisoning incident gets reported in 2026. Then a wave. Within a year, "memory hygiene" sits next to authentication, authorization, and audit on the product compliance checklist. Provenance + signing + trust tiers go from research papers to vendor checkboxes. The teams that didn't read arXiv:2604.16548 spend a quarter rewriting their memory layer.
Build memory like you mean it.
Or pay for it.vcn #33 · total recall · 2026-05-20
→ ray's bet table for 2027 · these are predictions, not facts. quote them back to me when you find them wrong.
● wing i interactive · the token arithmetic
DO YOU FIT IN CONTEXT, OR DO YOU NEED A LAYER?
Move the inputs. Watch the bar. The split shows you what the model is actually being asked to hold.
Green means it fits comfortably. Amber means compaction is mandatory. Red means no compaction will save you and you need an external memory layer.
compaction at 90%
0used / 200000
100%remaining
FITS COMFORTABLY
category
tokens
% of window
note
note: 800 tokens / turn assumes mixed user prompt + assistant reply. 1500 system-prompt baseline includes tool schemas. Compaction trigger at 90% mirrors Claude Code's default behavior.
● wing iii interactive · temporal edges, live
WATCH A FACT GO STALE.
Add a fact as three parts: subject, predicate, object. Each fact gets a valid-from timestamp.
Add a contradicting fact (same subject + predicate, different object) and the old one gets superseded, not deleted.
Then ask: "what was true at <datetime>?" and the graph answers from history. This is the temporal-edge model behind Zep / Graphiti.
edges (0)
no edges yet. add one above.
query · what was true at
● wing iii sortable · pick a column, see the order shift
SAME PROBLEM, FOUR SHAPES.
Four substrates for "remember things across sessions." Click a column header to sort.
The setup-time and substrate columns reveal that these are not interchangeable; each one bets on a different primitive.
The footer carries the verbatim positioning quote each framework's team uses.