LLM Batching Strategies
Maximize throughput and minimize cost by structuring prompts into efficient batches. This recipe covers chunking, concurrency windows, and backpressure patterns for production LLM pipelines.
Quick Reference
- 01Fixed-size chunking with overlap for long documents
- 02Adaptive batching based on token budget estimation
- 03Concurrency-limited promise pools with retry
- 04Streaming merge for ordered output reconstruction
Fixed-Size Chunking
Split input into equal-sized segments with configurable overlap. This prevents context-window overflow and ensures semantic continuity across chunk boundaries.
const chunks = splitWithOverlap(text, {
chunkSize: 2048,
overlap: 256
});Token Budget Estimation
Use a fast tokenizer (tiktoken or equivalent) to estimate token counts before dispatch. Group prompts so each batch stays under the provider's per-request limit while maximizing utilization.
Concurrency Pool
Limit in-flight requests with a semaphore pattern. Combine with exponential backoff and jitter to handle rate limits gracefully. A pool size of 8–16 works well for most providers under standard tier limits.
Pro tip: Always implement idempotency keys when batching writes. If a batch partially fails, you can safely retry without duplicate side effects.