← Docs
Recipe
Reranker eval harness
Build a repeatable offline evaluation loop for your reranker so you can measure MRR and NDCG before shipping a new model to production.
Ingredients
- A labeled query-document relevance dataset (MS MARCO or in-house)
- Your reranker model served behind a local HTTP endpoint
- A Python script that issues requests and scores results
- pytrec_eval or a hand-rolled MRR/NDCG calculator
Steps
- Freeze a test set. Pull 500–1000 queries with at least 10 judged documents each. Never train on these queries.
- Retrieve candidates. Use BM25 or your first-stage retriever to fetch the top-100 documents per query. This is the pool your reranker will reorder.
- Rerank. Send each query-document pair to your reranker endpoint. Collect the relevance scores.
- Sort and truncate. Reorder documents by the reranker score, keep top-k (k=10 is standard).
- Evaluate. Feed the ranked lists and qrels into pytrec_eval. Track MRR@10 and NDCG@10 as your primary metrics.
- Compare baselines. Run the same eval on BM25-only and on your previous reranker version. A 2% NDCG lift is meaningful.
Watch out for
- Unjudged documents treated as irrelevant — they skew NDCG downward
- Overfitting to MS MARCO's passage length distribution
- Latency drift when the reranker model is swapped under load
Pro tip: Log every reranker score alongside the final metric run. When MRR regresses, you can diff score distributions to find which query clusters broke.