Batch API design
Design batch endpoints that process many inferences in a single round trip without sacrificing error isolation, observability, or downstream backpressure. This recipe covers the contract, idempotency keys, and partial-failure semantics Meridian recommends for production workloads.
1. Shape the request envelope
A batch request is an array of independent units of work, each carrying its owncustom_idso callers can correlate results without preserving array order. Keep the envelope flat and avoid nesting batches inside batches.
POST /v1/batches
{
"model": "azure/model-router",
"items": [
{ "custom_id": "row-1", "input": "Summarize: ..." },
{ "custom_id": "row-2", "input": "Translate: ..." }
],
"idempotency_key": "job-2026-06-27-a1b2"
}2. Isolate failures per item
A single malformed prompt should never poison an entire batch. Return a top-level2xxwith a per-item status array so the caller can retry only the rows that failed. Treat rate-limit and transient gateway errors as retryable; treat schema errors as terminal.
3. Stream results when latency matters
For batches over a few hundred items, switch from a blocking response to a job handle plus an SSE stream of completions. This keeps memory bounded on both sides and lets you surface progress to end users. Persist the job manifest so resumption survives client reconnects.