Skip to content
Back to blog
AIMemoryRAGnClawPostgresArchitectureAcademic

Infinite Memory: How nClaw Recalls a Conversation From a Year Ago

15 min read

The context window is not memory. It's working memory. Real recall across months requires server-side thread storage, a memory extraction pipeline, summary pyramids, an alias matrix, and hybrid retrieval. Here's how nClaw builds all of it.

Earlier this week I asked nClaw something I hadn't thought about since last August. I was deep in a different nSelf sprint and needed to remember a specific decision I'd made about how the plugin-auth flow should handle workspace tokens. I typed the question, hit enter, and it pulled the exact thread, the exact context, the reasoning I'd written out at the time. Eight months old. Not paraphrased. The actual thing.

That only works because of a specific architecture. And for most of my 19 years writing software, I wouldn't have known how to build it. So this post is the breakdown.

Most people frame the memory problem wrong. They look at a 200K-token context window and think: if we just make that bigger, the AI will remember everything. That framing is a red herring. The context window is not memory. It's working memory, a short-lived execution cache that vanishes the moment the session ends. Real memory is retrieval. And retrieval is an engineering problem, not a model scaling problem.

I've been building nClaw, a personal AI workstation that runs as an nSelf plugin, for the better part of a year. One of the things I want it to do, one of the reasons I started, is to recall a conversation from twelve months ago. Not summarize it vaguely. Recall it: the specific claim I made, the entity I was researching, the decision I almost took. This post is about how the architecture actually works when you try to build that for real.

The first honest thing I have to say is that the base nClaw you can install today does not do this. Threads and projects in the current public build are stored in browser localStorage. That's a fine starting point for a single-device, single-session tool. It's not a foundation for recall across a year. The production move, which I'm implementing now, is to make Postgres the source of truth for every thread, every message, every piece of extracted memory, and every archived document.

Server-side thread state changes the problem completely. Instead of a session-scoped object in the browser, a thread becomes a row in nclaw.thread with a UUID, a taxonomy path, a stored summary, and a foreign key to a project and workspace. Every message in that thread is a row in nclaw.message. Nothing lives in the browser except the UI state for the current view.

The schema matters here. Each message row carries four representations of its text: the raw markdown, the plain text, a normalized form produced by a stored function that lowercases, strips accents, and collapses non-alphanumeric characters to spaces, and a generated tsvector column built from that normalized form. The tsvector column gets a GIN index. The normalized column gets a trigram index via pg_trgm. These are not redundant. Each one serves a different query shape, and you need all of them.

Semantic search alone will miss too much. If I ask nClaw about 'Ch35' and the note from fourteen months ago says 'Chapter 35', a pure vector search may not bridge that gap cleanly. The trigram index will. If the note uses 'ch. 35' and the search uses 'chapter thirty-five', the alias matrix closes the gap before the query even reaches the retrieval layer. This is why I keep coming back to the same point: building on pgvector alone is not enough. The combination of full-text search, trigram similarity, and vector distance is what makes recall reliable.

The memory extraction pipeline is the part that makes recall feel intelligent rather than mechanical. After every conversation turn is persisted, an asynchronous worker, nclaw-memory-extractor, reads the new messages and does this: identifies entities (people, projects, concepts, technical terms), extracts facts as subject-predicate-object triples, and pulls out preferences and tasks. None of this is synchronous. The turn is persisted immediately; the extraction follows through a transactional outbox and a Redis Streams consumer group.

The outbox pattern is worth spelling out because it solves a real failure mode. If I wrote to the message table and then tried to publish to Redis in the same request, I'd have two separate write operations with no atomicity guarantee. A crash between them loses the job. The outbox works differently: the row in event_outbox is written in the same database transaction as the message. If the transaction commits, the event exists. A separate relay process reads the outbox and publishes to the Redis Stream. At-least-once delivery, exactly the right tradeoff for background indexing work.

Facts land in nclaw.fact_triple as subject entity, predicate, object entity or text, confidence score, source message ID, and optional validity range. The confidence score is important because the extraction model is not always right. A fact extracted with 0.4 confidence is still useful, it can surface as a candidate during retrieval, but it should not crowd out a fact extracted with 0.9 confidence from three corroborating sources. The supersedes_id field on memory items handles updates: when a later message revises an earlier fact, the new item points back to the old one rather than deleting it.

Preferences and tasks are treated as memory_item rows with a kind enum. A preference might be 'user prefers detailed code comments over inline documentation'. A task might be 'follow up with the Vercel team about edge config limits by end of month'. Both are queryable, both carry embeddings, both can be retrieved in context when relevant. The distinction matters less than the fact that they're structured, indexed, and retrievable rather than buried in a wall of prose.

The summary pyramid is how nClaw handles scale without scanning millions of rows on every retrieval. After each conversation turn, a per-turn summary is written. Once a day, those turn summaries roll up into a daily summary. Weekly summaries aggregate the dailies. Monthly summaries aggregate the weeklies. Each level carries an embedding. The pyramid is not just a cost optimization. It's a different kind of memory. The per-turn level is specific and concrete. The monthly level is thematic and general. When I ask nClaw about my work on a project last spring, the monthly summary surfaces the broad context while the weekly and daily levels narrow it down.

The summaries are not free text appended to a log. They're stored as memory_item rows with kind='summary', attached to a thread or project, with an embedding in the memory_embedding_1024 table. The embedding model I use as the default is BAAI/bge-m3: 1024 dimensions, multilingual, supports inputs up to 8192 tokens. The HNSW index on the embedding table makes approximate nearest-neighbor search fast enough to be a normal part of every retrieval query.

Now the taxonomy. Every thread and every project carries a taxonomy_path column of type ltree. The ltree extension in Postgres is designed for hierarchical path data. A path like 'work.nself.nclaw.architecture' lets me query everything under work.nself with a single index scan rather than a recursive join. The taxonomy_node table stores the full hierarchy with labels and metadata. When I create a new thread, I place it in the taxonomy manually or let an extraction pass suggest a path. Over time, the taxonomy becomes a navigable map of my thinking.

The alias matrix is one of those pieces of the system that only becomes essential after you've been frustrated by its absence. The problem is entity name drift. I might refer to the same person as 'Ali', 'Ali C.', 'ali@camarata.com', or 'Aric's business partner Ali' across different conversations and documents. Without an alias system, these look like different entities. With an alias matrix, they resolve to a single canonical entity, and every mention of any variant retrieves the same set of facts.

The alias_variant table stores the mapping from variant to canonical. The variant_norm column holds the normalized form of each variant, and a trigram index lets the retrieval layer do fuzzy matching at query time. The alias expansion happens in the pre_retrieve hook, before any FTS or vector query is issued. The raw query comes in, the normalization function runs, the alias table is scanned for trigram-similar variants of each named entity, and the canonical forms are injected into the retrieval subqueries. The query 'Ali's feedback on the auth flow' becomes a retrieval that finds all messages and facts related to the canonical entity, regardless of which name variant appeared in the original text.

Permission-aware retrieval is not optional if nClaw is going to handle real personal data. The Postgres schema uses row-level security. Every significant table has a workspace_id column, and the RLS policies enforce that a query running under a given workspace context can only see rows belonging to that workspace. This is structural isolation, not application-level filtering. Hasura mirrors these permissions in its metadata: the same rules that govern what Postgres returns also govern what the GraphQL layer exposes. A cross-tenant memory leak is not just unlikely. It's structurally impossible under normal operation.

The eviction policy, or rather the deliberate absence of one, is a design choice. I don't evict memory from Postgres. Old memories are archived to cold storage in MinIO, with the metadata and embeddings remaining in Postgres. The full text moves to an object in MinIO, pointed to by the source_document table. This way, retrieval still works against archived content. The embedding and FTS surfaces are still live, but the raw text lives in cheap object storage rather than occupying Postgres disk. For most personal workstation use cases, the archived content set will be small enough that this distinction barely matters. But the architecture supports it cleanly.

Let me walk through the 'year ago' recall flow end to end. The setup: fourteen months ago I had a conversation about a specific technical decision in an nSelf plugin. I want to retrieve it today.

The query arrives at nclaw-gateway as a normal chat completion request. The pre_route hook runs first. It normalizes the language, estimates the intent (this is a memory recall query, not a code generation request), and checks whether retrieval is needed. It is.

The pre_retrieve hook expands aliases. The query contains 'nSelf plugin' and possibly a person's name. The alias table is scanned for trigram-similar variants of each named entity. Canonical forms are identified.

nclaw-retriever fires three parallel queries. The lexical query hits the GIN index on content_fts across messages, memory items, and document chunks, filtering to the requesting user's workspace. The vector query hits the HNSW index on chunk_embedding_1024 and memory_embedding_1024. The exact-hit query pulls thread summaries and entity facts linked to the identified canonical entities.

Reciprocal Rank Fusion merges the three candidate lists. RRF is the right choice here because the scores from lexical and vector retrieval are not on the same scale. You can't simply add them. RRF converts each list to ranks and combines the ranks with a harmonic formula. The result is a single ranked list of candidates where items that appear high in multiple lists score well, regardless of their raw distances or BM25 scores.

The cross-encoder reranker, BAAI/bge-reranker-v2-m3, takes the top fifty candidates from the fused list and scores each one against the original query jointly. This is the expensive step. The bi-encoder embeddings that powered the vector search compress the query and each document independently; the cross-encoder reads them together, which is much more accurate for final ranking. We only run it on fifty candidates, not on the full corpus.

The context packer assembles the final prompt. It pulls the top eight thread-or-memory hits, eight project-level hits, twelve document chunk hits, and the relevant summary pyramid entries for the relevant time period. The context window is now used well: it contains dense, relevant, pre-ranked material, not a random dump of recent conversation. The generator, Qwen3-14B or whichever local model is registered for the general role, produces the answer with citations back to the source message IDs.

The post_generate hook runs after the answer is produced. It writes the new turn to the database, submits the turn for async memory extraction, updates the thread summary, and records everything to audit_log. The conversation's memory grows by one more turn.

That is the complete loop. The fourteen-month-old conversation was reachable because the messages were stored in Postgres server-side, not lost when the browser session ended. The entities in that conversation were indexed in the alias matrix. The monthly summary for that period survived in the pyramid. And the hybrid retrieval system found the relevant chunks through both their semantic similarity and their lexical overlap with the query.

There's a version of this architecture that skips the memory extraction pipeline and just stores raw messages. It would still work for exact-phrase recall. But it would fail for the subtler cases: 'what did I decide about authentication?', 'what is my preferred approach to database migrations?', 'what tasks did I leave open during the nSelf sprint in February?'. These questions require structured memory, facts, preferences, tasks, not just full-text search over prose.

There's also a version that uses a separate vector database like Qdrant or Weaviate. I considered it. The argument for a dedicated vector store is query performance at very large scale. The argument against it is operational complexity and the loss of transactional consistency with the rest of the data. For a personal workstation handling millions of messages rather than billions, pgvector with an HNSW index is fast enough and dramatically simpler to operate. One database, one backup strategy, one permission model.

The routing rules deserve a brief mention because they affect memory quality indirectly. The nclaw.routing_rule table stores when_json predicates against model slugs. A code-heavy query with a long context budget routes to Qwen3-Coder-30B-A3B-Instruct, which has a 256K native context window. A general question routes to Qwen3-14B-FP8. A query flagged as requiring verification routes to the 32B reasoner as a second pass. The model registry is declarative and stored in Postgres; changing the inference backend does not require a code deploy.

The nSelf infrastructure underpins all of this. Hasura sits in front of Postgres and handles the GraphQL surface and permission metadata. The nclaw schema runs inside the same Postgres instance that nSelf already provisions, with pgvector, pg_trgm, unaccent, and ltree extensions enabled. The nSelf nginx proxies handle TLS and routing. The monitoring stack, Prometheus, Grafana, Loki, Tempo, is already running via nSelf's monitoring bundle. nClaw is not a standalone product that needs its own infrastructure. It's a plugin that extends the infrastructure I already have.

Now the limits, because there are real ones.

False-positive recall is the most common failure mode in production memory systems. The hybrid retrieval returns a result that is semantically similar to the query but factually unrelated. A conversation about database indexing surfaces when I ask about a personal project because both involve the word 'schema'. The cross-encoder reranker catches many of these, but not all. The solution is confidence-weighted presentation: show the top result with a citation, show the retrieval trace on request, and let me confirm before the answer is treated as authoritative.

Summary drift is subtler and harder to fix. As the pyramid rolls up, turn to day to week to month, each summarization step introduces a small model-dependent distortion. Over time, a monthly summary may emphasize themes that were present but secondary, or flatten nuances that mattered. I check for this by periodically retrieving the original turns that contributed to a suspect summary and comparing them manually. There is no automated solution here that I trust completely.

The 'what was important?' problem is the deepest one. Memory extraction assigns a salience score to each memory item. But salience at extraction time, based on position in the conversation, length, assertion strength, is a poor proxy for salience at retrieval time. Something I mentioned briefly in passing might turn out to be the most consequential decision I made that month. The memory extractor doesn't know that. Only time and subsequent context reveal it. This is why the archive-not-evict policy matters: keeping everything means you can re-evaluate salience later with better context, rather than having permanently lost a low-salience item that turned out to matter.

I'm also cautious about the entity resolution step. The alias matrix is hand-curated and extraction-augmented, but it's not perfect. Two different people with similar names will occasionally merge if the normalization and trigram similarity thresholds are too loose. Two references to the same concept will occasionally split if the normalization is too strict. Tuning the weight and similarity threshold parameters in alias_variant is an ongoing process.

What I've described here is not a finished product. It's a stable kernel, the BIOS of a personal AI workstation. The schema is production-ready. The retrieval contract is defined. The eviction policy is explicit. The permission model is structural. Everything else, better summarization models, smarter salience scoring, richer taxonomy, additional memory kinds, plugs into this kernel without changing its foundations. That's the point. Building the right thing slowly beats building the wrong thing quickly.

If I had to name the single most important architectural decision in this whole system, it would be this: treat the context window as execution cache, not memory. Memory lives in Postgres. The context window is where you put the retrieved pieces that are relevant to this specific question, right now. Once that distinction is clear, the rest of the design follows naturally.

Related Posts