← Back to docs
Recipe: Hybrid (BM25 + vector) retrieval
Combine sparse lexical scoring with dense semantic embeddings for recall that neither method achieves alone.
Why hybrid
BM25 excels at exact keyword matches and rare terms. Dense vectors capture paraphrases and conceptual similarity. A linear combination of both scores — tuned with a single α weight — consistently outperforms either pipeline in isolation on standard IR benchmarks.
Architecture
┌──────────┐ ┌──────────┐
│ BM25 │ │ Vector │
│ index │ │ index │
└────┬─────┘ └────┬─────┘
│ score_d │ score_v
└───────┬───────┘
▼
final = α·score_d + (1-α)·score_v
│
▼
┌──────────┐
│ Merge │
│ top-k │
└──────────┘Tuning α
- α = 0.3 — vector-heavy, best for conversational queries
- α = 0.5 — balanced default for general collections
- α = 0.7 — BM25-heavy, best for code or exact identifier search
Score normalization
BM25 scores are unbounded; cosine similarity lives in [−1, 1]. Normalize both to [0, 1] with min-max scaling over the candidate set before the linear blend. Without normalization the α parameter becomes meaningless.
Next step: Recipe: Cross-encoder reranking — layer a transformer on top of hybrid candidates for final precision.