Back to Journal
AI Infrastructure
May 12, 2026 12 min read

Why Hybrid Retrieval Beats Pure Vector Search in Production RAG

The Semantic Similarity Trap

In the early stages of building our Hybrid AI System, we relied exclusively on vector embeddings for retrieval. The promise was simple: "Capture the meaning, not just the keywords." However, once we moved from curated benchmarks to production data, we hit a wall.

Naive vector search failed consistently on three fronts:

  1. Specific Identifiers: UUIDs, product codes, and version numbers (e.g., "v2.4.1") are often mapped to similar vector space coordinates as other versions, leading to incorrect context.
  2. Domain-Specific Jargon: Rare technical terms often don't have enough embedding density to be retrieved reliably against common synonyms.
  3. The "Out-of-Distribution" Problem: New terms not present in the LLM's training set often result in erratic vector placements.

Architecture: The Dual-Pipeline Strategy

To solve this, we moved to a Hybrid Retrieval Architecture. Instead of a single query, every user request triggers two parallel processes:

1. The Semantic Pipeline (Dense Retrieval)

Using ChromaDB with text-embedding-3-small, we retrieve the top 50 documents based on cosine similarity. This captures the "intent" and "nuance" of the query. We found that setting the chunk size to 512 tokens with a 10% overlap provided the best balance between context retention and retrieval precision.

2. The Keyword Pipeline (Sparse Retrieval)

We leverage PostgreSQL Full Text Search with TSVECTOR and GIN indexing.

-- Example of our optimized keyword search with weight-based ranking
SELECT id, 
       ts_rank_cd(text_vector, query, 32 /* rank by frequency and density */) AS rank
FROM documents, to_tsquery('english', 'distributed & systems') query
WHERE text_vector @@ query
ORDER BY rank DESC
LIMIT 50;

The Fusion Logic: Reciprocal Rank Fusion (RRF)

The hardest part of hybrid search isn't retrieving results; it's merging them. Since vector scores (0 to 1) and BM25/FTS scores (0 to ∞) are on different scales, we implemented Reciprocal Rank Fusion.

The formula we used: Score(d) = Σ (1 / (k + rank(d, p))) where k=60.

This ensures that a document appearing at the top of either list is given significant weight, while documents appearing in both are prioritized as the "gold standard" context. We also experimented with Cross-Encoders for re-ranking the top 10 results, which further improved accuracy but added ~200ms to the latency—a tradeoff we decided was worth it for "Analytical" queries but skipped for "Conversational" ones.

Engineering Tradeoffs: Latency vs. Precision

Implementing hybrid retrieval added approximately 18ms of overhead to our p95 latency.

  • The Cost: Increased compute on the DB layer and additional orchestration logic.
  • The Gain: A 40% increase in precision and a near-total elimination of "hallucinations of omission" where the model claimed it didn't have information that was clearly present in the database as a keyword.

Failure Cases & Mitigation: The "Keyword Bomb" Problem

During the rollout, we noticed that "Stopwords" in the keyword pipeline were polluting the RRF scores. We had to implement a custom stopword filtering layer before the PG query to ensure that common terms like "system" didn't drown out specific terms like "Kafka". Additionally, we implemented Query Expansion using a small local LLM to generate synonyms before the keyword search, significantly improving recall for users who didn't know the exact technical terminology.

Final Takeaway: Systems Over Models

Vector search is not a replacement for keyword search; it is an augmentation. For production-grade AI infrastructure, the reliability of exact matching combined with the intelligence of semantic search is the only way to achieve "Senior-level" retrieval accuracy. The future of RAG isn't better models; it's better retrieval engineering.