RAG Pipeline Patterns
Production retrieval-augmented generation architectures — chunking strategies, embedding models, vector stores, and re-ranking.
Chunking Strategy
Recursive character splitting with 512-token chunks and 64-token overlap. Use semantic boundaries — paragraph breaks, section headers — before falling back to fixed windows. Store chunk metadata (source doc, position index) alongside vectors.
Embedding Model
text-embedding-3-small for cost efficiency at 1536 dimensions. Batch embed in groups of 100. Cache embeddings in your vector store — never re-embed unchanged documents.
Vector Store
Pinecone or pgvector with HNSW indexing. Use cosine similarity. Maintain a metadata filter for document freshness — stale chunks get soft-deleted and re-indexed on the next sync cycle.
Retrieval + Re-rank
Fetch top-20 candidates from vector search, then re-rank with a cross-encoder (bge-reranker-v2-m3). Keep top-5 for the LLM context window. This two-stage pipeline catches semantic misses from pure embedding similarity.
Context Assembly
Format retrieved chunks with source attribution headers. Truncate to model context limit minus 1024 tokens for the system prompt and user query. Inject a citation instruction so the LLM references sources inline.
Pro tip: Run an offline eval harness — generate a QA dataset from your docs, measure recall@5 and answer faithfulness. Tune chunk size and overlap against those metrics, not intuition.