Recipe

RAG Pipeline Patterns

Production retrieval-augmented generation architectures — chunking strategies, embedding models, vector stores, and re-ranking.

Chunking Strategy

Recursive character splitting with 512-token chunks and 64-token overlap. Use semantic boundaries — paragraph breaks, section headers — before falling back to fixed windows. Store chunk metadata (source doc, position index) alongside vectors.

Embedding Model

text-embedding-3-small for cost efficiency at 1536 dimensions. Batch embed in groups of 100. Cache embeddings in your vector store — never re-embed unchanged documents.

Vector Store

Pinecone or pgvector with HNSW indexing. Use cosine similarity. Maintain a metadata filter for document freshness — stale chunks get soft-deleted and re-indexed on the next sync cycle.

Retrieval + Re-rank

Fetch top-20 candidates from vector search, then re-rank with a cross-encoder (bge-reranker-v2-m3). Keep top-5 for the LLM context window. This two-stage pipeline catches semantic misses from pure embedding similarity.

Context Assembly

Format retrieved chunks with source attribution headers. Truncate to model context limit minus 1024 tokens for the system prompt and user query. Inject a citation instruction so the LLM references sources inline.

Pro tip: Run an offline eval harness — generate a QA dataset from your docs, measure recall@5 and answer faithfulness. Tune chunk size and overlap against those metrics, not intuition.