RECIPE / RAG

Speculative RAG

Speculative RAG pairs a fast draft model with a slower verifier to cut p50 latency on retrieval-augmented generation by 40-60% without losing answer quality. This recipe walks through wiring it up against the Meridian gateway in under twenty minutes.

01.Pick your draft + verifier pair

The draft model should be cheap and fast: think a 7-13B parameter open weight or a gpt-4o-mini class endpoint. The verifier is the model you would have used anyway. Meridian routes both calls through one endpoint with the azure/model-router alias when you want adaptive selection.

02.Draft N candidate answers in parallel

Fan out three to five draft completions with high temperature. Each draft sees a subset of retrieved chunks. Diversity at the draft stage is what gives the verifier something to choose between. Costs stay flat because the draft model is roughly ten times cheaper per token than your production model.

03.Verify and return the winner

Score each draft with the verifier in a single batched call. Return the highest-scoring draft to the user. Tail latency drops because most of the wall clock is the parallel draft phase, not sequential reasoning over the full context.

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://llm.getnimbus.net/v1",
  apiKey: process.env.MERIDIAN_KEY,
});

const drafts = await Promise.all(
  Array.from({ length: 4 }).map(() =>
    client.chat.completions.create({
      model: "azure/gpt-4o-mini",
      messages,
      temperature: 0.9,
    })
  )
);

const verified = await client.chat.completions.create({
  model: "azure/model-router",
  messages: scoreMessages(drafts),
});

return verified.choices[0].message.content;