Recipe: Model fallback chain design

When a primary inference provider fails, degrade gracefully through a ranked chain of alternatives without dropping the user session.

Chain topology

Define an ordered list of providers. Each entry carries a weight, a timeout, and a circuit-breaker threshold. The orchestrator walks the list on every request, skipping entries whose breaker is open.

Health signals

Track latency p99, error rate over a 60-second window, and token quota remaining. A provider is marked unhealthy when error rate exceeds 15% or p99 crosses the configured ceiling.

Recovery probe

Every 30 seconds send a lightweight canary request to unhealthy providers. If two consecutive probes succeed, close the breaker and restore the provider to the active chain.

Client contract

The API response includes a provider field so the frontend can surface which model served the request. No retry logic lives in the browser; the edge handles everything.

This pattern is deployed in Meridian's inference router. The chain currently spans three providers with a median failover time under 120 ms.