Recipe

Reranker eval harness

Build a repeatable offline evaluation loop for your reranker so you can measure MRR and NDCG before shipping a new model to production.

Ingredients

A labeled query-document relevance dataset (MS MARCO or in-house)
Your reranker model served behind a local HTTP endpoint
A Python script that issues requests and scores results
pytrec_eval or a hand-rolled MRR/NDCG calculator

Steps

Freeze a test set. Pull 500–1000 queries with at least 10 judged documents each. Never train on these queries.
Retrieve candidates. Use BM25 or your first-stage retriever to fetch the top-100 documents per query. This is the pool your reranker will reorder.
Rerank. Send each query-document pair to your reranker endpoint. Collect the relevance scores.
Sort and truncate. Reorder documents by the reranker score, keep top-k (k=10 is standard).
Evaluate. Feed the ranked lists and qrels into pytrec_eval. Track MRR@10 and NDCG@10 as your primary metrics.
Compare baselines. Run the same eval on BM25-only and on your previous reranker version. A 2% NDCG lift is meaningful.

Watch out for

Unjudged documents treated as irrelevant — they skew NDCG downward
Overfitting to MS MARCO's passage length distribution
Latency drift when the reranker model is swapped under load

Pro tip: Log every reranker score alongside the final metric run. When MRR regresses, you can diff score distributions to find which query clusters broke.