Batch Inference
Run thousands of requests concurrently
Maximize throughput by parallelizing inference calls. Use asyncio + httpx in Python or Promise.all in TypeScript to saturate your rate limit without overwhelming the API.
Python — asyncio + httpx
import asyncio
import httpx
API_KEY = "nmb_..."
URL = "https://api.getnimbus.net/v1/infer"
PAYLOADS = [{"prompt": f"batch-{i}"} for i in range(500)]
async def infer(client: httpx.AsyncClient, payload: dict) -> dict:
resp = await client.post(
URL,
json=payload,
headers={"Authorization": f"Bearer {API_KEY}"},
timeout=30.0,
)
resp.raise_for_status()
return resp.json()
async def main():
limits = httpx.Limits(max_connections=50, max_keepalive_connections=20)
async with httpx.AsyncClient(limits=limits) as client:
tasks = [infer(client, p) for p in PAYLOADS]
results = await asyncio.gather(*tasks, return_exceptions=True)
for i, r in enumerate(results):
if isinstance(r, Exception):
print(f"[{i}] FAIL: {r}")
asyncio.run(main())Tune max_connections to match your tier’s concurrency limit. Use return_exceptions=True so one failure doesn’t cancel the whole batch.
TypeScript — Promise.all
const API_KEY = "nmb_...";
const URL = "https://api.getnimbus.net/v1/infer";
const payloads = Array.from({ length: 500 }, (_, i) => ({
prompt: `batch-${i}`,
}));
async function infer(payload: { prompt: string }) {
const res = await fetch(URL, {
method: "POST",
headers: {
Authorization: `Bearer ${API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify(payload),
signal: AbortSignal.timeout(30_000),
});
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
}
const results = await Promise.allSettled(
payloads.map((p) => infer(p))
);
results.forEach((r, i) => {
if (r.status === "rejected") console.error(`[${i}] FAIL`, r.reason);
});Promise.allSettled prevents a single rejection from aborting the entire batch. Use AbortSignal.timeout to cap hanging requests.
Rate limits & retry strategy
Watch the headers
Every response includes X-RateLimit-Remaining and Retry-After. When remaining hits zero, pause the entire batch for the indicated duration before resuming.
Exponential backoff
For transient 429 or 5xx errors, retry with jitter: wait min(2^n + rand(0,1), 60) seconds. Cap at 3 retries per payload. This keeps the batch moving without hammering the API.
Concurrency ceiling
Free tier: 10 concurrent requests. Pro: 50. Enterprise: 200. Exceeding your ceiling triggers 429s immediately. Use a semaphore (Python asyncio.Semaphore or a p-limit library in TS) to cap in-flight requests at your tier’s limit regardless of batch size.