Recipe

Context compression for RAG

Shrink retrieved chunks before they hit the LLM — keep signal, drop noise.

Ingredients

  • Embedding model with strong retrieval recall
  • Summarization head (small, fast — T5 or Phi-3-mini)
  • Chunk-level metadata (source, position, token count)
  • Token budget per query (default 2k–4k)

Steps

  1. 1.Retrieve top-k chunks via vector similarity. Keep scores.
  2. 2.Sort by relevance. Drop chunks below similarity threshold.
  3. 3.Run each chunk through the summarization head. Output 1–2 sentences per chunk.
  4. 4.Concatenate compressed chunks. Truncate to token budget if needed.
  5. 5.Inject compressed context into the LLM prompt. Include source citations.

Why it works

Lower latency

Fewer tokens into the LLM means faster generation.

Higher precision

Compression strips filler — the model sees only what matters.

Cheaper inference

Token costs drop 40–60% with aggressive summarization.

More chunks fit

Compress 20 chunks into the space of 5 — broader coverage.