Recipe Inference Optimization
How Meridian reduces token waste and speeds up recipe generation through structured prompt routing and cached inference paths.
Prompt Routing
Meridian classifies incoming requests into one of three tiers — simple substitution, partial rewrite, or full generation — before the model ever sees a token. Lightweight classifiers run on-CPU in under 12ms, avoiding GPU spin-up for trivial queries.
Semantic Cache
Embedding-based deduplication catches near-duplicate requests. When cosine similarity exceeds 0.92 against a cached result, Meridian serves the stored response directly. Cache entries expire after 72 hours or on ingredient-seasonality drift.
Token Budgeting
Each inference tier carries a hard token cap. Substitutions get 512 tokens, partial rewrites 1024, full generations 2048. The system truncates mid-generation if the model overshoots, then stitches the partial output with a deterministic fallback template.
Batched Decoding
When queue depth exceeds 3 requests within a 200ms window, Meridian merges them into a single forward pass. Shared prefix KV-cache reuse cuts per-request latency by up to 40% under load.
Latency SLA: p95 under 800ms for cached hits, under 2.4s for full generations. Measured from API gateway ingress to first byte.