← Docs
Recipe

Recipe: RAG eval-set generator

Generate ground-truth evaluation sets from your own documents so you can measure retrieval quality before shipping.

Ingredients

  • A corpus of 50–200 documents in plain text or Markdown
  • An LLM with structured output (GPT-4o, Claude 3.5 Sonnet)
  • Your chunking pipeline already configured
  • A vector store with the corpus indexed

Steps

  1. Sample chunks. Randomly select 30–50 chunks from your indexed corpus. Prefer chunks with high information density — skip boilerplate headers and footers.
  2. Generate questions. For each chunk, prompt the LLM: “Write one specific question that can only be answered using the text below.” Collect the (question, source_chunk_id) pairs.
  3. Generate distractors. For each question, ask the LLM to produce 3 plausible but incorrect answers drawn from unrelated chunks. This creates a multiple-choice set.
  4. Validate retrieval. Run each question through your retriever. Flag any question whose source chunk does not appear in the top-5 results. Discard or rephrase those.
  5. Assemble the eval set. Store each entry as JSON: question, correct chunk ID, distractor chunk IDs, and the expected answer span.

Metrics to track

Recall@5

MRR

NDCG@10

Hit rate

Pitfalls

  • Questions that can be answered without retrieval (common-sense leakage)
  • Overfitting to chunk boundaries — vary chunk sizes in your sample
  • Using the same LLM for generation and evaluation inflates scores