Back to docsRecipe

LLM Batching Strategies

Maximize throughput and minimize cost by structuring prompts into efficient batches. This recipe covers chunking, concurrency windows, and backpressure patterns for production LLM pipelines.

Quick Reference

  • 01Fixed-size chunking with overlap for long documents
  • 02Adaptive batching based on token budget estimation
  • 03Concurrency-limited promise pools with retry
  • 04Streaming merge for ordered output reconstruction

Fixed-Size Chunking

Split input into equal-sized segments with configurable overlap. This prevents context-window overflow and ensures semantic continuity across chunk boundaries.

const chunks = splitWithOverlap(text, {
  chunkSize: 2048,
  overlap: 256
});

Token Budget Estimation

Use a fast tokenizer (tiktoken or equivalent) to estimate token counts before dispatch. Group prompts so each batch stays under the provider's per-request limit while maximizing utilization.

4K
Safe batch ceiling
85%
Target utilization

Concurrency Pool

Limit in-flight requests with a semaphore pattern. Combine with exponential backoff and jitter to handle rate limits gracefully. A pool size of 8–16 works well for most providers under standard tier limits.

Pro tip: Always implement idempotency keys when batching writes. If a batch partially fails, you can safely retry without duplicate side effects.