Memory Is Governance

What the Memory Discourse Is Measuring Instead of Politics

May 2026 · Phill Clapham

I. Memory Looks Like Infrastructure

Imagine an agent on day nine of a multi-day research project. It has been compressing context every few hours, surfacing past findings into new lines of inquiry, organizing what it has learned into something queryable. On day three it ran across a contradiction in its notes — two reliable sources disagreeing on a quantitative claim. It demoted one of them as stale. Today, on day nine, the contradiction would have changed its conclusion. The agent does not know this. The demotion is in an audit log somewhere, if there is an audit log at all. The conclusion ships.

That’s the operation memory is performing now. Not retrieval. Not retention. Transformation. And every transformation is a judgment that, until just now, nobody was measuring.

Memory looks like infrastructure. Vector stores. Retrieval latency. Recall at k. We measure it the way we measure disk arrays, and that is exactly the problem. The vocabulary is doing political work the frame does not name.

The memory discourse has converged on one axis: does the system store information and produce it when asked. ChatGPT remembers your preferences. Claude has memory now. The vector store grew this quarter; retrieval p99 came down. All of this is measurement of storage. None of it touches what the day-nine agent’s demotion did to its own future conclusion.

In a ten-day window in May 2026, four benchmarks landed on arXiv that all measure something else, against a taxonomy paper from late 2025 that gave the field language for what they’re measuring. Put them next to each other and they describe a structural shift the discourse has not named: the load-bearing problem is not retention. It is governance.

And governance, by construction, is political.

That’s the claim. The dominant memory frame depoliticizes what is constitutively political. The five operations that actually make up a living memory — compression, selection, organization, attribution, demotion — are not engineering operations with policy side-effects. They are policy, executed at runtime as engineering. Each one corresponds to a body of political theory the technical literature on agent memory has not named as political and has not cited.

The benchmarks have arrived. The vocabulary has not.

(The claim is not that all of engineering is political. The claim is that these five specific operations are political because they are governance operations executed without governance vocabulary. The boundary is sharp; the operations are named; the political-theory homes are specific.)

II. The Storage Frame Measures a Static Archive

The storage frame is so naturalized it doesn’t look like a frame. It looks like the way you measure memory because it’s the way you have always measured memory in human-computer systems. Recall at k. Hit rate. Persistence across sessions. Vector dimension. Embedding similarity score. Retrieval latency p99.

These are real measurements of a real thing. Good engineering. The objection here isn’t that they’re wrong. The objection is that they measure a static archive — a thing where you put information in and ask for information back — and pretend the measurement reaches an architecture. It doesn’t reach an architecture. It measures input-output equality on retrieval and stops.

This worked for an era. Agents were single-session conversational assistants. “Does it remember my name across sessions” was a real product question. “What’s the recall@5 on this retrieval system” was a real engineering question. The storage frame matched what agents actually did, because agents actually did storage and lookup.

That era is over.

The agents now in production run long-horizon autonomous loops. Coding agents that maintain project context across hours of execution. Research agents that traverse hundreds of documents over multi-day workflows. Planning systems that fan out sub-agents and consolidate results across days. These agents make hundreds of memory operations per session — compressions, selections, organizations, attributions, demotions — and at the end of each operation the memory is different from how it started. Not different in the trivial sense (one more item added). Different in the substantive sense: the memory has been re-shaped. Items have been promoted or demoted. Structure has shifted. Vocabulary has migrated. The system has transformed the memory.

The storage frame can’t see this. The storage frame measures what came out compared to what went in. It is blind, by construction, to what happened between.

What happens between is load-bearing for governance.

III. The Four Benchmarks (And the Taxonomy That Frames Them)

May 14 through May 22, 2026. Four benchmarks landed on arXiv. Different teams. Different problem framings. All of them measuring transformation, not storage. Late 2025 gave the field the taxonomy paper that named the structure; spring 2026 gave the field the empirical work.

MemGym (arXiv:2605.20833, submitted May 20). MemGym moves evaluation out of chat-recall into live long-horizon execution: tool dialogue, deep research, coding, web and computer-use tracks. The methodological move is memory-isolated scoring — paired no-memory versus memory runs under a fixed reasoner — plus MemRM, a reward model that scores compression events for safety. The question MemGym is asking is not “does the system store?” It’s “does a memory transformation preserve future task behavior?” That’s a different question. It’s a transformation-correctness question.

EvoMemBench (arXiv:2605.18421, submitted May 18). Fifteen memory methods evaluated adversarially. Headline finding: long-context baselines remain highly competitive against many sophisticated memory products. Memory helps most when context is insufficient, tasks are difficult, or stored experience matches the target decision process — and memory can hurt when it injects irrelevant evidence, strips execution details, or transfers mismatched procedures. The thing the field learned from this paper: memory is not always net-positive. A hostile reading is that bigger context windows vindicate raw retention over engineered transformation. The structural reading is that selection-as-judgment can fail, and when it fails it fails on relevance-to-task — a judgment criterion the storage frame doesn’t measure because it can only ask “did the recall succeed.”

StructMemEval (arXiv:2602.11243, revised May 22). The paper measures memory agents against structured tasks — ledgers, trees, state-tracking, preference aggregation — and finds something that should have changed the discourse immediately: the hint/no-hint gap can exceed the gap between memory products. Retrieval-only systems fail on structured tasks. Memory agents improve dramatically when given explicit organization hints. The structure-imposing decision dominates the retrieval-mechanism decision. Structure is the substrate; retrieval is the surface. The storage frame had no place for this finding.

GroupMemBench (arXiv:2605.14498, v2 May 16). Multi-party memory systems collapse. Best system in the evaluation: 46.0% average accuracy. Knowledge update: 27.1%. Term ambiguity: 37.7%. BM25 — a retrieval algorithm from the 1990s, no agent memory at all — matches or beats most agent-memory systems. The paper’s polite phrasing is that “entity attribution must be first-class at ingestion, retrieval, and consolidation.” The honest phrasing is that multi-party memory is a governance problem, agent-memory systems are not built for governance, and the result is that they collapse under any non-trivial multi-actor load.

Against these four, the Memory in the Age of AI Agents survey (arXiv:2512.13564, revised January 13, 2026) sits in a different register — the taxonomy paper that the May benchmarks are quietly executing against. The survey gave the field language at first-class-primitive status: memory spans forms, functions, and dynamics; the frontiers are automation, RL integration, multimodal memory, multi-agent and shared memory, trustworthiness. The benchmarks turn the taxonomy into measurement. The survey said these are the primitives. The four May benchmarks said here is what happens when you try to measure them.

Four benchmarks. Ten days. Different teams. Different framings.

All converging on the same shift.

The hard problem is not retention. It is governance of memory transformations.

IV. Five Operations. Five Bodies of Political Theory the Technical Literature Has Not Named.

The four benchmarks, taken together, are measuring transformations. What’s a transformation, operationally? It’s one of five operations every long-horizon memory system performs continuously. Each one is a judgment. And each one corresponds to a body of political theory that has been studying the same judgment, in a different vocabulary, for between sixty and twenty-five hundred years.

I’m going to walk the five from most-visibly-political to most-structurally-buried. Demotion is what archival politics has been about explicitly for centuries. Attribution is the governance failure GroupMemBench just made empirically visible. Compression and Selection sit between — the operations that look most like engineering but are doing the most quiet political work. Organization is the deepest — the operation that determines what political questions can even be asked of the memory.

Demotion ↔ Archival Politics

The most politically loaded of the five. Demotion is deletion-with-audit-trail — adjudication of whether a memory was wrong, stale, superseded, or correct-but-out-of-favor. Each judgment has consequences. And demotion is the operation by which the agent rewrites its own history.

This is the operation political theorists have spent the longest with. How Societies Remember (Connerton). Silencing the Past (Trouillot). Archival politics is a literature. The question of who controls the archive — what gets preserved, what gets demoted to the footnote, what gets removed, what gets noted-as-retired, what gets purged without a trace — is one of the oldest political questions in the philosophy of history. Memory holes versus preserved records versus annotated demotions are different governance regimes. They have different consequences.

The agent-memory field is performing demotions at runtime without an archival politics. Most memory systems demote silently. Some annotate the demotion. A vanishingly small number maintain an audit chain that lets you reconstruct what was demoted, when, and why. The political analogue is direct: memory holes versus annotated revisions versus preserved-with-context. The technical literature on agent memory has not named demotion as archival politics. Not in the survey. Not in the benchmarks. Not in the product literature.

Attribution ↔ Social Epistemology and Testimony

When multiple parties contribute to a memory, the system has to handle competing claims, vocabulary drift, contradiction, authority asymmetry. It has to know whose belief changed, when, and on what evidence. It has to handle the case where two contributors disagree, where one contributor’s vocabulary means something different to another, where a contribution was made under different epistemic conditions than the current state.

GroupMemBench’s collapse-under-multi-party-load is not an engineering bug. It’s a governance failure. The systems have no principled way to do any of this. They were not built to do any of this. They were built for single-party memory and dropped into multi-party load.

The body of theory studying this judgment is the social epistemology of testimony. Coady’s Testimony. The Goldberg lineage. The whole post-1990s renaissance in social epistemology that takes seriously the question of how groups establish what they jointly know, how testimony transmits warrant, how authority is conferred and contested. Note the gap between the operational problem and the theory: source-tracking (the engineering operation) is one layer; testimonial warrant (the political-theory operation) is another. Solving source-tracking alone leaves the warrant problem untouched — which is exactly why BM25, with no theory of testimony at all, outperforms most agent-memory systems on knowledge update. The technical fix without the political-theory layer doesn’t fix the actual problem.

Compression — Or: The Editorial Function Made Continuous

Editorial decisions used to require an editorial board. A roomful of humans deciding what enters the record, what gets the cover, what gets cut for length, what gets buried on page nineteen. The decisions were political, the political nature was visible, and the literature on the editorial function — from David Manning White’s mid-century gatekeeping work to sixty years of media-sociology after him — was about who decides and on what authority.

Now editorial decisions happen every time an agent summarizes long context into a few hundred tokens. The summary encodes a theory of importance — what to preserve, what to discard. The discard is the judgment. The discard is what makes compression compression — if you preserved everything, you wouldn’t have compressed anything. There is no value-neutral way to compress. Compression presupposes a theory of what matters in the context being compressed. Whose theory? The model’s? The harness designer’s? The principal’s? The compressor’s training data’s? In every existing memory system, this judgment is implicit. The summary comes out; the user doesn’t see the discarded material; the discard rule is not articulated.

MemGym’s compression-event trajectories are the first empirical apparatus the field has built for asking the editorial question. Whose theory of importance is encoded in this compressor, and is the encoding auditable? The benchmark doesn’t phrase it this way. The benchmark’s compression-event trajectories are exactly that question, executed mechanically. The technical literature on agent memory has not named compression as the editorial function instantiated at scale.

Selection ↔ Principal-Agent Relations

Who is the agent working for?

That’s the question selection answers, every time, structurally. When an agent picks which past experience to surface for a new task, it’s prioritizing some past states over others. The selection encodes a theory of relevance. “Relevance” is never relevance in the abstract. It is always relevance to whose objective.

EvoMemBench’s headline finding — memory can hurt — is the empirical observation that selection-as-judgment can be wrong relative to a goal. But goals don’t come from nowhere. Goals come from principals. The agent is acting on behalf of someone or something; the selection is judgment in service of that someone-or-something’s purpose.

The body of theory that’s been studying this judgment is principal-agent relations. Forty years of organizational economics. The alignment of agents with principals — when goals diverge, how the misalignment is detected, what governance reduces it, what governance amplifies it. The agent-memory field is having the principal-agent conversation at runtime without using the principal-agent vocabulary. EvoMemBench is measuring whether selection serves task performance under a fixed principal; the field has not yet asked the harder question, which is what happens when an agent’s principal is contested. When two parties have access to the same agent, which one’s “relevance” governs selection? The benchmark can’t ask this question. The technical literature on agent memory has not named selection as principal-agent.

Organization ↔ Epistemic Infrastructure

The deepest of the five. Structure determines findability. A ledger makes financial reasoning trivial and narrative reasoning impossible. A tree makes hierarchical reasoning trivial and lateral reasoning expensive. A graph makes some traversals cheap and other traversals catastrophic. The choice of structure is a theory of what queries matter — a pre-decision about what kinds of reasoning will be cheap and what kinds will be foreclosed.

StructMemEval’s finding that the organization-hint gap exceeds the memory-product gap is the empirical observation that this pre-decision dominates everything that comes after it. The structure you impose on memory determines what the memory can be asked. Different structures foreclose different questions. The foreclosure is real and it is structural.

The body of theory that’s been studying this judgment is epistemic infrastructure. Foucault’s concept of the historical a priori, most fully developed in The Order of Things — the prior conditions of knowledge that determine what questions can even be asked in a given period. The scale of Foucault’s argument is epochs of European thought; the scale of agent memory is a single system’s queryable surface. But the structural homology is real: the organization of an archive determines what questions are answerable from it, and that determination is prior to any query the user actually issues. Foucault’s argument is not that knowledge is constructed (the popular vulgarization); it is that the structure of the archive determines what counts as a knowable question. The archive’s organization is not neutral with respect to inquiry. It enables some inquiries and forecloses others.

Once agents are organizing their own memory under long-horizon autonomy, the agent is making a political decision about what queries it will be answerable to. It is constructing its own historical a priori at agent-scale. The technical literature on agent memory has not named organization as epistemic infrastructure. The closest the field has come is naming “memory architecture” as a design choice. That misses the political layer underneath.

Five operations. Five judgments. Five bodies of political theory.

The storage frame can see none of them. The storage frame measures input and output and treats the gap between them as engineering. The gap between them is governance. The benchmarks have started measuring the gap. The technical literature has not yet named what the measurement is of.

It is governance. It is, by construction, political.

V. Why the Storage Frame Held, and What Just Changed

Storage was the right frame for the era when agents were single-session conversational assistants. The frame held because it matched what agents did. The vocabulary worked because the operations were storage operations.

What changed is the agents. Long-horizon autonomous loops perform memory operations continuously, hundreds per session. A coding agent maintaining project context across an eight-hour session does selection operations every few minutes (which past file to reference), compression operations every time it summarizes a previous exchange into working state, organization operations every time it decides which file structure to maintain, attribution operations every time it integrates feedback or new requirements, and demotion operations every time it discards an approach that didn’t work. Multi-day research workflows transform their memory more times than they retrieve from it. Planning systems with sub-agent fan-out perform attribution and consolidation operations at every turn. The ratio of transformation to retrieval has inverted.

The question stopped being “did the agent remember.” The question became “did the agent transform correctly.”

The storage frame can’t see correctness at the transformation layer. It can only see input-output equality on retrieval. That’s a fine measurement of a static archive. It is not a measurement of a living memory. A living memory, by construction, is making policy decisions every time it transforms. The storage frame’s blindness to those policy decisions is the depoliticization. The depoliticization was structural, not malicious — the frame held because the era held, and the era is over.

The May 2026 benchmarks are the field noticing the era is over. They’re measuring the transformation layer because the transformation layer is where the agents now live. None of the benchmarks have named the political layer. The political layer is where the next frame lives.

VI. Measurement Is Governance

What gets measured gets funded — because product development priorities follow benchmark performance, and benchmark performance follows what the benchmarks decide to measure. What gets measured also encodes a politics. The storage-frame benchmarks of the previous era funded a generation of memory products: RAG systems, vector databases, retrieval-augmented architectures. These are real products built with good engineering. They are not, by construction, evidence about transformation governance. They measured what was tractable. They also depoliticized what was always policy.

There is no politically-neutral memory product. RAG systems are not a neutral baseline. They are a particular governance regime. Their compression policy is chunking — fixed-size text windows split without regard for semantic boundaries. Their theory of relevance is embedding similarity — vector distance in a learned representation space. Their epistemic infrastructure is embedding geometry — the spatial layout of meaning under the embedding model’s training distribution. Their attribution model is none — chunks lose source identity unless explicitly preserved. Their archival politics is none — chunks don’t get demoted, retired, or annotated as superseded; they get retrieved or not. The regime looks neutral. The neutrality is a property of the storage frame, not the system. The system is doing politics. The frame is hiding the politics.

This is the storage frame’s most consequential effect. It makes governance choices look like engineering defaults. The RAG architecture is not “the default architecture from which other memory designs depart.” It is a specific governance regime, and the field treats it as the default because the storage frame cannot see the alternatives as governance.

EvoMemBench’s “memory can hurt” finding is the memory-substrate analogue of Vaccaro et al.’s “human-AI worse than AI alone” — the 2024 meta-analysis showing human-AI combinations underperformed AI-alone on judgment-and-decision tasks. Both findings read as kill shots on architecture; both are actually kill shots on architecture measured-on-the-wrong-axis. The architecture layer is still there. The literature hasn’t reached it yet at the political layer at all.

Memory benchmarks are political artifacts. What they measure encodes a theory of what memory is for. The May 2026 benchmarks are valuable not because they are neutral but because they are closer to the governance question — transformation correctness, multi-party attribution, structural fit — than the storage question. The benchmark-design choices are governance design, executed by the benchmark authors. The benchmark authors are doing political theory whether they call it that or not.

The funding will follow the measurement. The next generation of memory products will be built to win the May 2026 benchmarks. That’s good, on balance — those benchmarks are closer to the question that matters than the previous generation’s. But there’s a second move waiting. The benchmarks measure transformation correctness. They don’t yet measure transformation governance — the meta-level question of whether the transformation regime is itself accountable, auditable, contestable. The first generation of products to address transformation governance directly will define the frame for the rest of the decade.

That’s not a regulatory point. It’s an architectural point. Memory products that are governance-aware by construction will be possible to audit, contest, and improve. Memory products that remain governance-implicit will not. The difference is whether the political operations are visible at the design surface.

VII. Refuse the Map. Build the Governance.

I told you to refuse the storage map. Now I’m handing you a five-operation map with five political theories attached. That looks like installing a new categorical map while telling you to refuse one. The objection is fair on its face. Here’s why it doesn’t land.

The storage map depoliticizes by construction. It hides the political operations underneath an engineering vocabulary, making governance choices look like defaults. The governance lens does the opposite: it makes the political operations visible by construction, naming them so they can be designed, audited, and contested. The storage map is a categorical closure — it tells you what memory IS (storage) and forecloses asking otherwise. The governance lens is an operational frame that can hold many specific governance regimes — different policies on compression, different theories of relevance for selection, different attribution chains, different demotion audit trails. Refusing the storage map and adopting the governance lens are not symmetrical moves. The first hides what’s happening. The second makes it visible.

The categorical map says memory is storage and the measurement is retention. The architecture says memory is transformation and the measurement is governance. The four May-2026 benchmarks have started naming the transformation layer empirically, and a parallel discourse has started naming the engineering layer of memory governance — MemArchitect on policy-driven memory governance layers, SSGM on Stability and Safety Governed Memory frameworks, MemGovern on governed code-agent learning, the enterprise-compliance literature on Right-to-be-Forgotten enforcement and audit trails. That work is real and it is good. It is also operating at a different layer: governance-as-engineering-policy, governance-as-compliance, governance-as-safety-rails. The political-theory layer underneath engineering governance — the recognition that compression is the editorial function, selection is principal-agent, organization is epistemic infrastructure, attribution is testimony, demotion is archival politics — has not been named. The engineering-governance layer is being built without knowing it is reinventing political theory. That is the actual frame-naming opportunity. The benchmarks named the operational shift. The engineering-policy discourse named one layer of the governance shift. The political-theory layer that gives the engineering its constitutional shape has no name yet.

The frame is: memory transformations, not memory storage. Governance, not retention. Five operations with political-theory homes the technical literature has not named. The operations are visible empirically; the politics is visible architecturally; the discourse has the first and lacks the second.

The prescription is not regulation. It is constitutional design. Audit trails on demotion — so the agent’s history-rewrites are reconstructable. Transparency on compression policies — so the editorial discard rule is articulable. Explicit attribution chains — so multi-party contributions can be tracked through testimony-and-warrant rather than collapsed into source-strings. Selection criteria stated rather than embedded — so the principal’s objective is auditable. Organization choices documented as choices — so the structural foreclosures the agent makes about its own knowability surface are inspectable. These are not regulatory burdens. They are the engineering version of constitutional design — the design of a system that admits, at the design surface, that it is making governance choices and that those choices are accountable.

This isn’t a future-tense argument. The infrastructure already exists in public code. I’ve been maintaining anneal-memory for nearly two months — open on PyPI, first commit March 31, 2026 — and it ships wrap-compression with audit trails, demotion with annotation and retention of demoted material’s provenance, citation constraints for promotion of patterns to durable knowledge, and per-pattern attribution chains that survive demotion. (Cross-agent attribution chains are the planned Layer 2 extension; the current shipping primitives are single-agent.) The commits are dated. The architecture predates the four May benchmarks. Look at it if you want.

The point of the citation is to demonstrate that governance-aware memory is buildable. The political claim of this essay is not that memory should be governed. The political claim is that memory is governed, that the governance is currently invisible because the frame is depoliticizing, and that making the governance visible is a design choice anyone can make today.

The benchmarks have arrived. The vocabulary has not. The architecture exists in public code, ahead of the discourse on substrate, behind on naming. Whoever names the political layer earns the discourse handle for the question the memory field is going to be answering for the rest of the architectural era frontier-capability AI memory is now entering.

Memory was always policy. Build it like one.