← Docs
Recipe

Reranker eval harness

Build a repeatable offline evaluation loop for your reranker so you can measure MRR and NDCG before shipping a new model to production.

Ingredients

  • A labeled query-document relevance dataset (MS MARCO or in-house)
  • Your reranker model served behind a local HTTP endpoint
  • A Python script that issues requests and scores results
  • pytrec_eval or a hand-rolled MRR/NDCG calculator

Steps

  1. Freeze a test set. Pull 500–1000 queries with at least 10 judged documents each. Never train on these queries.
  2. Retrieve candidates. Use BM25 or your first-stage retriever to fetch the top-100 documents per query. This is the pool your reranker will reorder.
  3. Rerank. Send each query-document pair to your reranker endpoint. Collect the relevance scores.
  4. Sort and truncate. Reorder documents by the reranker score, keep top-k (k=10 is standard).
  5. Evaluate. Feed the ranked lists and qrels into pytrec_eval. Track MRR@10 and NDCG@10 as your primary metrics.
  6. Compare baselines. Run the same eval on BM25-only and on your previous reranker version. A 2% NDCG lift is meaningful.

Watch out for

  • Unjudged documents treated as irrelevant — they skew NDCG downward
  • Overfitting to MS MARCO's passage length distribution
  • Latency drift when the reranker model is swapped under load

Pro tip: Log every reranker score alongside the final metric run. When MRR regresses, you can diff score distributions to find which query clusters broke.