Batch Inference

Run thousands of requests concurrently

Maximize throughput by parallelizing inference calls. Use asyncio + httpx in Python or Promise.all in TypeScript to saturate your rate limit without overwhelming the API.

Python — asyncio + httpx

import asyncio
import httpx

API_KEY = "nmb_..."
URL = "https://api.getnimbus.net/v1/infer"
PAYLOADS = [{"prompt": f"batch-{i}"} for i in range(500)]

async def infer(client: httpx.AsyncClient, payload: dict) -> dict:
    resp = await client.post(
        URL,
        json=payload,
        headers={"Authorization": f"Bearer {API_KEY}"},
        timeout=30.0,
    )
    resp.raise_for_status()
    return resp.json()

async def main():
    limits = httpx.Limits(max_connections=50, max_keepalive_connections=20)
    async with httpx.AsyncClient(limits=limits) as client:
        tasks = [infer(client, p) for p in PAYLOADS]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        for i, r in enumerate(results):
            if isinstance(r, Exception):
                print(f"[{i}] FAIL: {r}")

asyncio.run(main())

Tune max_connections to match your tier’s concurrency limit. Use return_exceptions=True so one failure doesn’t cancel the whole batch.

TypeScript — Promise.all

const API_KEY = "nmb_...";
const URL = "https://api.getnimbus.net/v1/infer";

const payloads = Array.from({ length: 500 }, (_, i) => ({
  prompt: `batch-${i}`,
}));

async function infer(payload: { prompt: string }) {
  const res = await fetch(URL, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify(payload),
    signal: AbortSignal.timeout(30_000),
  });
  if (!res.ok) throw new Error(`HTTP ${res.status}`);
  return res.json();
}

const results = await Promise.allSettled(
  payloads.map((p) => infer(p))
);

results.forEach((r, i) => {
  if (r.status === "rejected") console.error(`[${i}] FAIL`, r.reason);
});

Promise.allSettled prevents a single rejection from aborting the entire batch. Use AbortSignal.timeout to cap hanging requests.

Rate limits & retry strategy

Watch the headers

Every response includes X-RateLimit-Remaining and Retry-After. When remaining hits zero, pause the entire batch for the indicated duration before resuming.

Exponential backoff

For transient 429 or 5xx errors, retry with jitter: wait min(2^n + rand(0,1), 60) seconds. Cap at 3 retries per payload. This keeps the batch moving without hammering the API.

Concurrency ceiling

Free tier: 10 concurrent requests. Pro: 50. Enterprise: 200. Exceeding your ceiling triggers 429s immediately. Use a semaphore (Python asyncio.Semaphore or a p-limit library in TS) to cap in-flight requests at your tier’s limit regardless of batch size.