Recipe

ML model serving design

A production blueprint for deploying inference endpoints with latency budgets, GPU scheduling, and canary rollouts.

⏱

P99 under 120ms with Triton Inference Server and ONNX Runtime. Pre-warmed model cache, pinned GPU memory, and batched dynamic shapes.

⚙

CUDA MPS for concurrency, model ensemble DAGs, and priority queues per tenant SLA tier.

🔄

Traffic splitting 5% → 50% → 100% with automated rollback on drift detection or error rate spike.

Architecture overview

Ingress — Envoy sidecar terminates TLS, enforces rate limits, and routes gRPC prediction requests to the model server pool.

Model repo — Versioned artifacts in S3 with signed manifests. Triton polls every 30s and loads new versions without downtime.

Observability — Prometheus metrics on inference latency, queue depth, and GPU utilization. OpenTelemetry traces span ingress to tensor output.

← Back to docs|Meridian · v2.4