Grounding strategy for hallucinations
Hallucinations happen when a model has no anchor in verifiable context. This recipe walks through a three-stage grounding pipeline that reduces fabricated facts by routing every claim through retrieval, citation, and self-check before it reaches the user.
1. Retrieve before you reason
Treat the model as a writer, not an oracle. Pull the top-k chunks from your vector store, rerank them with a cross-encoder, and inject them into the prompt before any reasoning step fires. If retrieval returns nothing relevant, prefer a refusal over a guess.
- Embed user query with the same model as the corpus
- Rerank top 50 down to top 5 with bge-reranker
- Drop chunks below a similarity floor of 0.35
2. Force inline citations
Require the model to emit a citation token after every factual sentence. Reject any response that contains an uncited claim during post-processing. This converts hallucination from a soft style problem into a hard schema violation.
POST /v1/chat/completions
{
"model": "azure/model-router",
"messages": [
{"role": "system", "content": "Cite every fact as [doc_id:chunk_id]. Refuse if unsupported."},
{"role": "user", "content": "What changed in v2.4?"}
],
"tools": [{"type": "retrieval", "namespace": "changelog"}]
}3. Self-check with a verifier pass
Run a second cheap model over the draft answer with only the retrieved chunks and the draft. Ask it to flag any sentence whose support is missing or contradicted. Strip flagged sentences before returning. The verifier pass adds latency but cuts fabrication rates by 60 to 80 percent in our internal benchmarks.