Context compression for RAG
Shrink retrieved chunks before they hit the LLM — keep signal, drop noise.
Ingredients
- •Embedding model with strong retrieval recall
- •Summarization head (small, fast — T5 or Phi-3-mini)
- •Chunk-level metadata (source, position, token count)
- •Token budget per query (default 2k–4k)
Steps
- 1.Retrieve top-k chunks via vector similarity. Keep scores.
- 2.Sort by relevance. Drop chunks below similarity threshold.
- 3.Run each chunk through the summarization head. Output 1–2 sentences per chunk.
- 4.Concatenate compressed chunks. Truncate to token budget if needed.
- 5.Inject compressed context into the LLM prompt. Include source citations.
Why it works
Lower latency
Fewer tokens into the LLM means faster generation.
Higher precision
Compression strips filler — the model sees only what matters.
Cheaper inference
Token costs drop 40–60% with aggressive summarization.
More chunks fit
Compress 20 chunks into the space of 5 — broader coverage.
See also: Hybrid search recipe