Recipe

ML model serving design

A production blueprint for deploying inference endpoints with latency budgets, GPU scheduling, and canary rollouts.

Latency budget

P99 under 120ms with Triton Inference Server and ONNX Runtime. Pre-warmed model cache, pinned GPU memory, and batched dynamic shapes.

GPU scheduling

CUDA MPS for concurrency, model ensemble DAGs, and priority queues per tenant SLA tier.

🔄

Canary rollout

Traffic splitting 5% → 50% → 100% with automated rollback on drift detection or error rate spike.

Architecture overview

IngressEnvoy sidecar terminates TLS, enforces rate limits, and routes gRPC prediction requests to the model server pool.

Model repoVersioned artifacts in S3 with signed manifests. Triton polls every 30s and loads new versions without downtime.

ObservabilityPrometheus metrics on inference latency, queue depth, and GPU utilization. OpenTelemetry traces span ingress to tensor output.

← Back to docs|Meridian · v2.4