← Back to Docs
Inference

Recipe Inference Optimization

How Meridian reduces token waste and speeds up recipe generation through structured prompt routing and cached inference paths.

Prompt Routing

Meridian classifies incoming requests into one of three tiers — simple substitution, partial rewrite, or full generation — before the model ever sees a token. Lightweight classifiers run on-CPU in under 12ms, avoiding GPU spin-up for trivial queries.

Semantic Cache

Embedding-based deduplication catches near-duplicate requests. When cosine similarity exceeds 0.92 against a cached result, Meridian serves the stored response directly. Cache entries expire after 72 hours or on ingredient-seasonality drift.

Token Budgeting

Each inference tier carries a hard token cap. Substitutions get 512 tokens, partial rewrites 1024, full generations 2048. The system truncates mid-generation if the model overshoots, then stitches the partial output with a deterministic fallback template.

Batched Decoding

When queue depth exceeds 3 requests within a 200ms window, Meridian merges them into a single forward pass. Shared prefix KV-cache reuse cuts per-request latency by up to 40% under load.

Latency SLA: p95 under 800ms for cached hits, under 2.4s for full generations. Measured from API gateway ingress to first byte.