Recipe
ML model serving design
A production blueprint for deploying inference endpoints with latency budgets, GPU scheduling, and canary rollouts.
Latency budget
P99 under 120ms with Triton Inference Server and ONNX Runtime. Pre-warmed model cache, pinned GPU memory, and batched dynamic shapes.
GPU scheduling
CUDA MPS for concurrency, model ensemble DAGs, and priority queues per tenant SLA tier.
Canary rollout
Traffic splitting 5% → 50% → 100% with automated rollback on drift detection or error rate spike.
Architecture overview
Ingress — Envoy sidecar terminates TLS, enforces rate limits, and routes gRPC prediction requests to the model server pool.
Model repo — Versioned artifacts in S3 with signed manifests. Triton polls every 30s and loads new versions without downtime.
Observability — Prometheus metrics on inference latency, queue depth, and GPU utilization. OpenTelemetry traces span ingress to tensor output.