Reducing latency
Practical techniques to cut response times and deliver faster completions across every region.
Choose the closest region
Route requests to the inference endpoint nearest your users. Prefix your API host with the region code — us-east, eu-west, ap-southeast. A single DNS change often shaves 40–120 ms off every round-trip.
Pick a smaller model
Larger models produce richer output but add latency. For latency-sensitive workloads — chatbots, autocomplete, live transcription — switch to a compact variant. The quality gap is narrower than most benchmarks suggest, and time-to-first-token drops by 30–60%.
Enable streaming
Set stream: true in your request body. Tokens arrive incrementally instead of waiting for the full response. Perceived latency drops to near-zero because the UI begins rendering immediately.
Stop early with max_tokens
Cap output length via max_tokens. If your use case only needs a short answer — classification, entity extraction, yes/no — set a tight limit. The model stops generating sooner, and you avoid paying for tokens nobody reads.
Combine techniques
The fastest path: closest region + compact model + streaming + a sensible max_tokens. Each lever compounds. Measure with the dashboard latency panel and iterate.
Need help tuning your deployment? Browse the full docs or reach out to support.