Recipe: RAG eval-set generator
Generate ground-truth evaluation sets from your own documents so you can measure retrieval quality before shipping.
Ingredients
- A corpus of 50–200 documents in plain text or Markdown
- An LLM with structured output (GPT-4o, Claude 3.5 Sonnet)
- Your chunking pipeline already configured
- A vector store with the corpus indexed
Steps
- Sample chunks. Randomly select 30–50 chunks from your indexed corpus. Prefer chunks with high information density — skip boilerplate headers and footers.
- Generate questions. For each chunk, prompt the LLM: “Write one specific question that can only be answered using the text below.” Collect the (question, source_chunk_id) pairs.
- Generate distractors. For each question, ask the LLM to produce 3 plausible but incorrect answers drawn from unrelated chunks. This creates a multiple-choice set.
- Validate retrieval. Run each question through your retriever. Flag any question whose source chunk does not appear in the top-5 results. Discard or rephrase those.
- Assemble the eval set. Store each entry as JSON: question, correct chunk ID, distractor chunk IDs, and the expected answer span.
Metrics to track
Recall@5
—
MRR
—
NDCG@10
—
Hit rate
—
Pitfalls
- Questions that can be answered without retrieval (common-sense leakage)
- Overfitting to chunk boundaries — vary chunk sizes in your sample
- Using the same LLM for generation and evaluation inflates scores