Self-RAG Pattern
Self-RAG teaches a model to decide when it needs to retrieve, then critique its own grounded answer before returning. On Meridian you wire this with two cheap router calls around a single retrieval step, and the whole loop stays under a second on warm cache.
1. Decide whether to retrieve
Start with a draft completion from the cheapest model that still handles your domain. Pass the draft to a small judge model and ask a single yes/no question: does this answer need external sources? Most short conversational turns return no and exit early, which keeps your retrieval bill flat.
2. Retrieve and ground
When the judge says retrieval is needed, call meridian.search with a small topK and re-prompt the router model with the documents inlined. Keep the prompt tight: a header, the sources, then the question. The router picks a stronger model automatically when token count rises.
3. Critique and fall back
Self-RAG fails quietly without a critique step. Score the grounded answer for factual support on a 1-5 scale. If the score lands at 4 or higher, ship it. Otherwise return the original draft and log the miss to /api/meridian/audit so you can tune the retriever later.
Reference implementation
// Self-RAG loop pseudocode
async function selfRag(query: string) {
const draft = await llm.complete({
model: 'azure/model-router',
prompt: query,
});
const needsRetrieval = await llm.complete({
model: 'azure/gpt-4o-mini',
prompt: `Does this answer need sources? ${draft}`,
});
if (needsRetrieval.startsWith('yes')) {
const docs = await meridian.search(query, { topK: 5 });
const grounded = await llm.complete({
model: 'azure/model-router',
prompt: `Sources:\n${docs}\n\nQuestion: ${query}`,
});
const critique = await llm.complete({
model: 'azure/gpt-4o',
prompt: `Rate factual support 1-5: ${grounded}`,
});
return critique.score >= 4 ? grounded : draft;
}
return draft;
}