Why map-reduce?
Large language models have fixed context windows. When a document exceeds that window — think legal contracts, research papers, or multi-hour transcripts — you cannot simply paste the whole thing and ask for a summary. The model will truncate silently or produce garbage. Map-reduce solves this by splitting the work into two phases: a parallel map step that summarizes each chunk independently, followed by a reduce step that combines those partial summaries into a final coherent result.
Phase 1 — Map
Split the source document into overlapping chunks. Overlap is critical: without it, key sentences that span a chunk boundary get orphaned and lost. A good default is 20% overlap with chunk sizes of 2000–4000 tokens depending on your model.
Chunking parameters
- Chunk size: 3000 tokens
- Overlap: 600 tokens (20%)
- Split on: paragraph boundaries, falling back to sentence boundaries
Each chunk is sent to the model with an identical system prompt:"Summarize the following excerpt. Preserve all named entities, numbers, dates, and technical terms verbatim. Output only the summary with no preamble."
Run all map calls in parallel. With 10 chunks and a fast inference endpoint, the entire map phase completes in under 3 seconds. Each chunk returns a 150–300 token summary.
Phase 2 — Reduce
Concatenate all chunk summaries in document order. If the combined length still exceeds your context window, recurse: treat the concatenated summaries as a new document and map-reduce again. Otherwise, send the full concatenation to the model with a final reduce prompt.
Reduce prompt
"You are given a series of partial summaries from a longer document, in order. Synthesize them into a single coherent summary. Eliminate redundancy. Preserve all facts, figures, and named entities. Structure the output with a brief overview followed by key points."
Edge cases
- Short documents: If the entire document fits in the context window, skip the map phase and summarize directly. No need to pay the latency tax.
- Recursive depth: Cap recursion at 3 levels. If you still cannot fit the reduce input after 3 rounds, the document is too large for this strategy — consider extractive summarization or a different approach.
- Non-English text: Chunking on paragraph boundaries works universally, but your prompts should match the document language for best results.
- Structured data: Tables and lists degrade under naive chunking. Pre-process by extracting tables into a separate pass, summarizing each table independently, then injecting those summaries into the reduce phase.
Full pipeline
1. Load document → extract plain text
2. Count tokens → if ≤ context_window:
summarize directly → return
3. Split into overlapping chunks (3000 tokens, 20% overlap)
4. Map: summarize each chunk in parallel
5. Concatenate summaries in order
6. If concatenated length > context_window:
recurse from step 2 with concatenated summaries as input
7. Reduce: synthesize final summary
8. ReturnNext steps
This recipe pairs well with our chunking strategies guide for tuning overlap and split boundaries. For production pipelines handling thousands of documents per hour, see the batch processing recipe.