Hybrid Retrieval: Why pgvector Alone Isn't Enough

I've spent the better part of two decades building search into enterprise systems (AT&T, Capgemini, clients like Credit Karma and Cigna) and I made the same mistake each time: I trusted one retrieval signal too much. Full-text search when that was the hot thing, vector similarity once embeddings got good. The gap I kept hitting was this: users would search for something specific, something they knew was in the system, and the retriever would come back with thematically adjacent content that completely missed the point. Building nClaw's memory layer finally forced me to solve it properly instead of paper over it.

There's a class of recall failure that embarrasses every RAG system eventually. You ask for something specific (a person's name, an acronym, a coded reference like "Ch35") and the retriever comes back with thematically adjacent content that completely misses what you wanted. The problem isn't the embedding model. The problem is that semantic similarity and exact recall are fundamentally different operations, and pure vector search only does one of them.

I ran into this while designing the memory retrieval layer for nClaw. The goal was simple: a personal AI workstation that could recall anything I'd fed it: emails, notes, chat exports, documents. With the kind of reliability I'd expect from a search engine, not from a drunk friend trying to remember something I said six months ago. That goal pushed me deep into hybrid retrieval, Reciprocal Rank Fusion, and eventually cross-encoder reranking. This post covers what I learned and why each piece exists.

Start with the retrieval problem statement. When you embed a query and search a vector index, you're finding content that is semantically close in the embedding space. That works well for topical queries: "what were we discussing about the deployment pipeline?" will surface relevant context because the embedding captures meaning. But it fails for exact queries. "Ch35" (a common reference to Chapter 35 of the Code of Federal Regulations, which governs VA vocational rehabilitation) has an embedding that shares space with other regulatory references, other numbered sections, other legal documents. The cosine distance to the exact string "Ch35" in a document titled "Chapter 35 Notes" is not zero. It might not even be in the top 10 results.

This isn't a hypothetical failure mode. Acronyms fail the same way: "VR&E" and "vocational rehabilitation and employment" live in nearby but not identical embedding neighborhoods, and a query using one won't reliably surface documents using the other. Person names are even worse. Common first names like "Ali" share embedding space with a hundred other contextual uses of that string. Medical codes, case numbers, ISBN references, git commit hashes: anything that's more like an identifier than a concept will silently miss on pure vector retrieval.

The academic framing here is the distinction between "recall by meaning" and "recall by term." Sparse retrieval systems like BM25 and TF-IDF were built for the latter. They don't understand meaning; they count tokens and weight them by frequency statistics. For decades, the NLP community spent enormous energy trying to replace sparse retrieval with dense retrieval. What the last several years of production experience have shown is that you don't want to replace one with the other. You want both, fused intelligently.

PostgreSQL gives you both in one database, which is why I made it the retrieval backbone for nClaw instead of a separate search service. The relevant extensions are pgvector for dense retrieval, tsvector/tsquery with GIN indexes for lexical retrieval, pg_trgm for fuzzy matching, and unaccent for normalization. Let me walk through each one.

The tsvector type stores a preprocessed lexical representation of a document. When you write a generated column like this: ```sql content_fts tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('simple', coalesce(content_norm, '')), 'A') ) STORED ``` PostgreSQL pre-processes the text at write time and stores the lexeme set. A GIN index on that column turns full-text queries into fast bitmap lookups. The 'simple' dictionary skips stemming and stopword removal, which matters for exact domain terms, because stemming 'Ch35' to some reduced form is the last thing you want.

The tsquery syntax lets you express boolean retrieval: `to_tsquery('simple', 'Ch35 | chapter35 | ch_35')` hits any document containing any of those tokens. This is precise in a way that vector search isn't. If the string "Ch35" appears in a document, this query will find it. That's the guarantee you need for exact recall.

pgvector handles the other half. It stores high-dimensional float vectors and supports nearest-neighbor search. The two index types it offers have different tradeoff profiles. IVFFlat partitions the vector space into Voronoi cells (the `lists` parameter), assigns each vector to its nearest centroid, and queries by searching the nearest `probes` lists. It's fast to build but its recall degrades if the query vector lands near a cell boundary, and you only find out it missed when the system returns confidently wrong answers.

HNSW (Hierarchical Navigable Small World) builds a layered proximity graph during indexing. Queries traverse the graph in log time, and the recall curve is much better than IVFFlat at equivalent query speed. The index takes longer to build and uses more memory (the `m` parameter controls edge count per node; `ef_construction` controls the quality of the index-time search). For a personal workstation where the dataset fits in RAM and you'd rather have reliable recall than marginally faster index builds, HNSW is the right default: ```sql CREATE INDEX idx_chunk_embedding_hnsw ON nclaw.chunk_embedding_1024 USING hnsw (embedding vector_cosine_ops); ```

The distance metric choice matters too. Cosine similarity normalizes by vector magnitude, which means it compares direction rather than magnitude. For text embeddings that encode semantic meaning, direction is what you care about. A long document and a short document about the same topic should score similarly regardless of the magnitude difference their different token counts produce. L2 (Euclidean) distance is the right choice for embeddings where magnitude carries information, like image feature vectors. For text RAG, use cosine.

pg_trgm works by breaking strings into overlapping trigrams (sequences of three characters) and building an index over those character n-grams. A GIN index on a trigram column turns approximate string matching into an indexed operation. The similarity function returns a float between 0 and 1, where 1 is an exact match: ```sql CREATE INDEX idx_message_norm_trgm ON nclaw.message USING gin(content_norm gin_trgm_ops); -- Query: SELECT * FROM nclaw.message WHERE similarity(content_norm, 'chaptr 35') > 0.3; ``` This catches typos, OCR artifacts, and casual spelling variations that neither FTS nor vector search would find. 'chaptr 35' has an edit distance of 1 from 'chapter 35'. BM25 misses it entirely. A vector embedding might or might not surface it depending on training data. Trigram similarity finds it reliably.

The unaccent extension normalizes away diacritical marks before indexing. This matters more than it sounds for a personal workstation that ingests email and documents from multiple sources. 'naïve' and 'naive', 'résumé' and 'resume', 'Aric Camarata' in a document that was OCR'd from a scanned letter versus typed. Unaccent makes these variants equivalent at index time. The pattern is to run normalization through a stored function that lower-cases, strips accents, and collapses non-alphanumeric characters: ```sql CREATE FUNCTION nclaw.normalize_text(input_text text) RETURNS text LANGUAGE sql IMMUTABLE AS $$ SELECT trim( regexp_replace( lower(unaccent(coalesce(input_text, ''))), '[^a-z0-9]+', ' ', 'g' ) ); $$; ``` Store the normalized form as a generated column alongside the raw text. Index the normalized form for fuzzy matching. Run both the query and the indexed values through the same normalization function.

Now the alias problem. Most retrieval discussions gloss over this, but it's a first-order practical issue. Users reference the same entity in different ways: "Chapter 35", "Ch35", "ch.35", "38 CFR Chapter 35", "VR&E Chapter 35". These all point to the same thing. Without an alias system, you're counting on users to query the exact term that appears in the indexed document, which they won't do.

The right model is a relational alias table, not a heuristic. Something like this: ```sql CREATE TABLE nclaw.alias_variant ( id bigserial PRIMARY KEY, workspace_id uuid NOT NULL REFERENCES nclaw.workspace(id), entity_id uuid REFERENCES nclaw.entity(id), variant text NOT NULL, variant_norm text NOT NULL, canonical_text text NOT NULL, canonical_norm text NOT NULL, source text NOT NULL DEFAULT 'manual', weight real NOT NULL DEFAULT 1.0 ); CREATE INDEX idx_alias_variant_trgm ON nclaw.alias_variant USING gin(variant_norm gin_trgm_ops); ``` The query pipeline expands the user's input against this table before retrieval. If the query matches a known variant, you retrieve using the canonical form instead. This is deterministic, auditable, and maintainable. When you discover that users write "VRE" and "VR&E" interchangeably, you add both as variants of the canonical entity. The system immediately improves. No retraining, no re-embedding. Just a row in a table.

The alias expansion query looks like this: ```sql WITH q AS ( SELECT $1::text AS raw_query, nclaw.normalize_text($1::text) AS norm_query ), alias_hits AS ( SELECT av.canonical_text, av.canonical_norm, av.weight FROM nclaw.alias_variant av, q WHERE av.variant_norm % q.norm_query OR av.variant_norm = q.norm_query ORDER BY similarity(av.variant_norm, q.norm_query) DESC, av.weight DESC LIMIT 20 ) SELECT * FROM alias_hits; ``` The `%` operator is pg_trgm's similarity threshold operator. You get both exact and fuzzy alias matching in a single indexed lookup.

With all three retrieval surfaces ready (lexical FTS, semantic pgvector, and exact/alias), you need a way to combine them. This is where Reciprocal Rank Fusion enters.

Cormack, Clarke, and Buettcher introduced RRF in their 2009 SIGIR paper "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." The core insight is elegant: rather than trying to normalize and compare scores across different retrieval systems (which have incompatible score distributions), you combine the rank positions. The formula is: ``` score(d) = Σ_i 1 / (k + rank_i(d)) ``` where `k` is a smoothing constant (typically 60), `rank_i(d)` is the rank of document `d` in retrieval system `i`, and the sum runs over all retrieval systems. Documents not present in a system's top-K get an implicit rank of K+1 or are omitted.

The reason you use reciprocal rank instead of raw scores is that scores from different systems aren't comparable. A BM25 score of 12.7 and a cosine similarity of 0.84 tell you nothing useful about which document is more relevant. Rank positions are comparable: being ranked 3rd by the lexical system and 1st by the vector system is meaningful regardless of what the underlying scores were. The constant k=60 was chosen empirically in the original paper to prevent high-ranked documents from dominating excessively. A document ranked 1st contributes 1/(60+1) ≈ 0.0164; a document ranked 10th contributes 1/(60+10) ≈ 0.0143. The difference is real but not catastrophic.

The Anserini and Pyserini retrieval toolkits have published extensive benchmarks comparing fusion strategies across TREC, BEIR, and MS MARCO datasets. The consistent finding is that RRF improves over any single retrieval method on recall-oriented benchmarks, and that the performance is surprisingly insensitive to the exact choice of k. You can set k=60 and forget it. Weighted sum combinations, by contrast, require you to calibrate weights per dataset, which makes them nearly impossible to tune for a general-purpose system processing heterogeneous personal documents.

A concrete SQL implementation of the fusion looks like this: ```sql WITH -- Lexical retrieval: top 100 by FTS rank lexical AS ( SELECT c.id, row_number() OVER (ORDER BY ts_rank_cd(c.chunk_fts, query.q) DESC) AS lex_rank FROM nclaw.source_chunk c, to_tsquery('simple', $1) AS q WHERE c.chunk_fts @@ q AND c.workspace_id = $2 LIMIT 100 ), -- Semantic retrieval: top 100 by cosine similarity semantic AS ( SELECT c.id, row_number() OVER (ORDER BY e.embedding <=> $3::vector ASC) AS sem_rank FROM nclaw.source_chunk c JOIN nclaw.chunk_embedding_1024 e ON e.chunk_id = c.id WHERE c.workspace_id = $2 ORDER BY e.embedding <=> $3::vector LIMIT 100 ), -- Collect all candidates candidates AS ( SELECT id FROM lexical UNION SELECT id FROM semantic ), -- RRF fusion with k=60 fused AS ( SELECT c.id, COALESCE(1.0 / (60 + l.lex_rank), 0) + COALESCE(1.0 / (60 + s.sem_rank), 0) AS rrf_score FROM candidates c LEFT JOIN lexical l ON l.id = c.id LEFT JOIN semantic s ON s.id = c.id ) SELECT f.id, f.rrf_score, sc.chunk_text FROM fused f JOIN nclaw.source_chunk sc ON sc.id = f.id ORDER BY f.rrf_score DESC LIMIT 50; ``` The $3 parameter is the query embedding vector. You compute it once at query time and pass it in. The union of lexical and semantic candidates gives you a pool that neither system would produce alone.

BGE-M3 is relevant here not just as an embedding model but as a validation of the architecture. The model card for BAAI/bge-m3 explicitly recommends hybrid retrieval combined with reranking as the expected usage pattern. It even provides a reference retrieval pipeline. The 1024-dimensional embeddings, multilingual support, and 8192-token document limit make it the right default for a personal workstation ingesting heterogeneous sources. Using BGE-M3 for both dense retrieval and as a feature source for reranking is a design choice with direct support from the model's own documentation.

After RRF fusion, you have a ranked list of 50 candidates. Those candidates were selected by combining lexical and semantic evidence, which is good. But the ranking is still based on independent retrieval signals, not on the joint relevance of the query and the full document text. Cross-encoder reranking fixes this.

A cross-encoder takes the query and each candidate document as a pair and produces a relevance score by attending across both simultaneously. This is much more expensive than bi-encoder retrieval (which pre-encodes documents independently), which is exactly why you don't use it for the initial retrieval stage. On 50 candidates it's fast enough to be practical. The reranker sees the full query in context with each candidate and scores them jointly.

BAAI/bge-reranker-v2-m3 is the natural pairing with BGE-M3 for this architecture. The model card positions it as a lightweight multilingual reranker for exactly this cross-encoder step. The reranking call is straightforward: pass each (query, candidate_text) pair to the reranker's scoring endpoint and resort the 50 candidates by the output score. The top 20-30 candidates after reranking become the context window for generation.

The full pipeline, from query to context pack, looks like this: ``` function Retrieve(query, scope): nq = normalize(query) expanded = alias_expand(nq, scope) # relational lookup exact = thread_and_entity_hits(nq, scope) # exact match, rank first lexical = fts_candidates(nq, scope, top_k=100) vector = vector_candidates(embed(query), scope, top_k=100) fused = rrf_fuse([exact, lexical, vector], k=60) reranked = cross_encoder_rerank(query, fused.top(50)) return context_pack( local_thread = reranked.filter(source='thread').top(8), project = reranked.filter(source='project').top(8), docs = reranked.filter(source='document').top(12), summaries = summary_pyramid(scope) ) ```

One piece that doesn't get enough attention: why does the context pack differentiate by source layer? The local thread hits are the most recent, most contextually specific memories. They should appear first in the context window. Project-level hits are still scoped but potentially older. Document hits are broader. Summary entries from the summary pyramid give the model a coarse orientation before the fine-grained evidence. That ordering matters because transformer attention isn't perfectly uniform. Positioning your most relevant evidence in the first and last positions of the context window gets slightly better utilization than burying it in the middle.

ltree adds a dimension that none of the above handles: hierarchical recall. If a project has a taxonomy path of `work.veterans.education.ch35`, an ltree query using the `<@` ancestor operator will match anything under `work.veterans`. This is how you implement "show me everything from the veterans project" without scanning all workspace content. It's also how you build topic-scoped retrieval: narrow the candidate set by taxonomy path before running lexical and vector search, reducing both noise and query time. The GiST index on ltree paths makes ancestor/descendant queries fast even on large taxonomies.

The combination of all five mechanisms is not over-engineering: tsvector for lexical recall, pgvector for semantic recall, pg_trgm for fuzzy tolerance, unaccent for normalization, and ltree for hierarchical scoping. Each one handles a failure mode that the others miss. FTS finds exact terms that vector search would miss. Vector search finds conceptually relevant content that FTS would miss. Trigrams catch the spelling drift and OCR noise that neither FTS nor vector handles. The alias matrix handles the term-variant problem systematically. ltree scopes queries to avoid cross-project noise. RRF combines the evidence without requiring score normalization. The cross-encoder reranks the fused results with full joint attention. Each layer has a specific job.

There are real reasons this architecture might be overkill for your use case. If your corpus is small (under 50,000 chunks), the recall difference between pure vector search and hybrid retrieval probably doesn't matter in practice. You'll get high recall either way because the index is small enough that cosine search covers most of it. If your users only ask natural-language questions and never use codes, identifiers, or abbreviations, the FTS layer won't add much. If your content is homogeneous (all prose, one language, no domain jargon), the alias and normalization layers are unnecessary overhead. The hybrid architecture pays off specifically when the corpus is large, the queries are mixed (sometimes exact, sometimes semantic), and the content has domain-specific terminology.

There are also situations where this architecture is still not enough. The most common is temporal context: "what was I thinking about this project six months ago" is a query where recency should weight heavily, but RRF doesn't know about time. You need a post-fusion temporal filter or a separate summary pyramid that materializes period-level summaries. The other gap is multi-hop reasoning: "find the email that referenced the note I wrote about Ch35" requires linking across sources, which single-stage retrieval can't do. That calls for a multi-round retrieval loop where the first round surfaces candidate entities and the second round uses those entities as anchors for deeper search. Neither problem is solved by better fusion; they require a different orchestration pattern on top of the retrieval layer.

The central point is that retrieval in a personal workstation is fundamentally harder than retrieval in a web search index. Web search operates on well-formed text documents from known sources, optimized for crawling. A personal workstation ingests emails with quoted history, OCR'd scans, exported chat logs, code files, calendar events, and voice-to-text transcriptions. The content quality is lower, the term variation is higher, and the tolerance for misses is lower. If you ask for a specific thing you know you stored, getting thematically adjacent noise is not acceptable. Hybrid retrieval with RRF and cross-encoder reranking is the architecture that gets you from "probably finds relevant content" to "reliably finds the exact thing you asked for." That difference matters more in a personal context than almost anywhere else.

Hybrid Retrieval: Why pgvector Alone Isn't Enough

Related Posts

Infinite Memory: How nClaw Recalls a Conversation From a Year Ago

Owning Your Stack: Local LLMs Plus Frontier Models, On Your Terms

The Multi-AI Router: Stop Sending Every Question to GPT-5.4 or Opus 4.7