Recipe

RAG retrieval-augmented generation recipe

Build a production RAG pipeline in five steps. Chunk your documents, embed them into vector space, store the vectors, retrieve the top-k matches at query time, and feed them into an LLM prompt. No magic — just Python and a vector database.

Chunk your text

Split documents into overlapping chunks. Overlap preserves context across boundaries. Choose chunk size based on your embedding model's token limit — 512 tokens is a safe default for most models.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", " "]
)
chunks = splitter.split_documents(documents)

Embed each chunk

Convert each chunk into a dense vector using an embedding model. OpenAI's text-embedding-3-small is cheap and high-quality. For offline use, reach for sentence-transformers on HuggingFace.

from openai import OpenAI

client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=[chunk.page_content for chunk in chunks]
)
vectors = [d.embedding for d in response.data]

Store vectors

Persist vectors in a vector database. Pinecone is managed and fast. For local dev, Chroma or Qdrant in Docker work great. Store metadata alongside each vector so you can surface source links.

import pinecone

pc = pinecone.Pinecone(api_key="...")
index = pc.Index("meridian-docs")

# Upsert vectors with metadata
index.upsert(vectors=[
    (
        f"chunk-{i}",
        vectors[i],
        {"text": chunks[i].page_content, "source": chunks[i].metadata["source"]}
    )
    for i in range(len(vectors))
])

Retrieve top-k

At query time, embed the user's question with the same model and run a similarity search. Cosine similarity is the default. Return the top 3–5 chunks — enough context without drowning the prompt.

query_vec = client.embeddings.create(
    model="text-embedding-3-small",
    input=[user_question]
).data[0].embedding

results = index.query(
    vector=query_vec,
    top_k=5,
    include_metadata=True
)

context = "\n\n".join(
    [m["metadata"]["text"] for m in results["matches"]]
)

Build the prompt

Inject the retrieved context into a system prompt. Tell the model to answer only from the provided sources. This grounds the response and dramatically reduces hallucination.

system_prompt = f"""
You are a helpful assistant. Answer the user's question
using ONLY the context below. If the answer is not in the
context, say 'I don't have enough information.'

Context:
{context}
"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_question}
    ]
)

answer = response.choices[0].message.content

Ready to ship your own RAG pipeline?

Browse the docs