RAG retrieval-augmented generation recipe
Build a production RAG pipeline in five steps. Chunk your documents, embed them into vector space, store the vectors, retrieve the top-k matches at query time, and feed them into an LLM prompt. No magic — just Python and a vector database.
Chunk your text
Split documents into overlapping chunks. Overlap preserves context across boundaries. Choose chunk size based on your embedding model's token limit — 512 tokens is a safe default for most models.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", " "]
)
chunks = splitter.split_documents(documents)Embed each chunk
Convert each chunk into a dense vector using an embedding model. OpenAI's text-embedding-3-small is cheap and high-quality. For offline use, reach for sentence-transformers on HuggingFace.
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input=[chunk.page_content for chunk in chunks]
)
vectors = [d.embedding for d in response.data]Store vectors
Persist vectors in a vector database. Pinecone is managed and fast. For local dev, Chroma or Qdrant in Docker work great. Store metadata alongside each vector so you can surface source links.
import pinecone
pc = pinecone.Pinecone(api_key="...")
index = pc.Index("meridian-docs")
# Upsert vectors with metadata
index.upsert(vectors=[
(
f"chunk-{i}",
vectors[i],
{"text": chunks[i].page_content, "source": chunks[i].metadata["source"]}
)
for i in range(len(vectors))
])Retrieve top-k
At query time, embed the user's question with the same model and run a similarity search. Cosine similarity is the default. Return the top 3–5 chunks — enough context without drowning the prompt.
query_vec = client.embeddings.create(
model="text-embedding-3-small",
input=[user_question]
).data[0].embedding
results = index.query(
vector=query_vec,
top_k=5,
include_metadata=True
)
context = "\n\n".join(
[m["metadata"]["text"] for m in results["matches"]]
)Build the prompt
Inject the retrieved context into a system prompt. Tell the model to answer only from the provided sources. This grounds the response and dramatically reduces hallucination.
system_prompt = f"""
You are a helpful assistant. Answer the user's question
using ONLY the context below. If the answer is not in the
context, say 'I don't have enough information.'
Context:
{context}
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_question}
]
)
answer = response.choices[0].message.contentReady to ship your own RAG pipeline?
Browse the docs