Back to docs
Architecture

Queue system design

How Meridian processes recipe generation requests at scale with fairness, retries, and observability.

Overview

Every recipe generation request enters a distributed queue backed by Upstash QStash. This decouples the API layer from GPU inference workers, absorbs traffic spikes, and guarantees at-least-once delivery.

Queue topology

Client → API (Vercel Edge)
   → QStash (persist + dedupe)
     → Worker pool (GPU inference)
       → Upstash KV (result cache)
         → Client poll / webhook

The edge API enqueues a message and returns a job ID immediately. Workers pull jobs, run the model, and write results to KV with a 24-hour TTL. Clients poll the status endpoint or receive a webhook callback.

Fair scheduling

Each user has a per-minute token bucket. Free tier: 3 RPM. Pro tier: 30 RPM. Exceeding the limit returns HTTP 429 with a Retry-After header. Within the bucket, jobs are FIFO.

Workers implement cooperative concurrency — each worker processes one job at a time to avoid GPU memory contention. Horizontal scaling adds workers; QStash distributes messages round-robin.

Retry policy

  • Max 3 attempts with exponential backoff: 1s, 4s, 16s.
  • Jitter of ±25% prevents thundering herd on retry storms.
  • Permanent failures (invalid prompt, banned content) go to a dead-letter log and are not retried.
  • Transient failures (GPU OOM, timeout) trigger retry with backoff.

Idempotency

Every request carries an idempotency key (UUID v7). QStash deduplicates messages with the same key within a 1-hour window. Duplicate submissions return the cached result from KV immediately.

Observability

Queue depth, processing latency (p50/p95/p99), and error rates are exported as Prometheus metrics from each worker. A Grafana dashboard surfaces queue health. Alerts fire on p99 latency exceeding 30s or dead-letter queue growth.

Circuit breaker

If the worker pool error rate exceeds 20% over a 60-second window, the API layer stops enqueuing new jobs and returns HTTP 503. The breaker half-opens after 30 seconds, allowing one probe request. Success resets the breaker; failure re-opens it.

Next: learn how Meridian handles prompt engineering for optimal recipe generation.