Recipe

Recipe: RAG eval-set generator

Generate ground-truth evaluation sets from your own documents so you can measure retrieval quality before shipping.

Ingredients

Sample chunks. Randomly select 30–50 chunks from your indexed corpus. Prefer chunks with high information density — skip boilerplate headers and footers.
Generate questions. For each chunk, prompt the LLM: “Write one specific question that can only be answered using the text below.” Collect the (question, source_chunk_id) pairs.
Generate distractors. For each question, ask the LLM to produce 3 plausible but incorrect answers drawn from unrelated chunks. This creates a multiple-choice set.
Validate retrieval. Run each question through your retriever. Flag any question whose source chunk does not appear in the top-5 results. Discard or rephrase those.
Assemble the eval set. Store each entry as JSON: question, correct chunk ID, distractor chunk IDs, and the expected answer span.

Recall@5

—

MRR

—

NDCG@10

—

Hit rate

—