Recipe: Model fallback chain design
When a primary inference provider fails, degrade gracefully through a ranked chain of alternatives without dropping the user session.
Chain topology
Define an ordered list of providers. Each entry carries a weight, a timeout, and a circuit-breaker threshold. The orchestrator walks the list on every request, skipping entries whose breaker is open.
Health signals
Track latency p99, error rate over a 60-second window, and token quota remaining. A provider is marked unhealthy when error rate exceeds 15% or p99 crosses the configured ceiling.
Recovery probe
Every 30 seconds send a lightweight canary request to unhealthy providers. If two consecutive probes succeed, close the breaker and restore the provider to the active chain.
Client contract
The API response includes a provider field so the frontend can surface which model served the request. No retry logic lives in the browser; the edge handles everything.
This pattern is deployed in Meridian's inference router. The chain currently spans three providers with a median failover time under 120 ms.