RAG with Postgres + pgvector
Build a retrieval-augmented generation pipeline entirely inside PostgreSQL. No Pinecone, no Weaviate — just the database you already run. Embed, index, and search vectors with cosine similarity, then feed the top-k chunks into your LLM.
1. Enable the Extension
Activate pgvector and trigram matching in your Postgres instance. Requires superuser or a cloud provider that bundles the extension (Supabase, Neon, Tembo).
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;2. Create the Documents Table
1536 matches OpenAI text-embedding-ada-002. Swap to 768 for text-embedding-3-small or 3072 for text-embedding-3-large. The IVFFlat index trades a tiny bit of recall for orders-of-magnitude speed on million-row tables.
CREATE TABLE documents (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding VECTOR(1536) NOT NULL
);
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);3. Embed & Insert
Chunk your source text into ~512-token segments with 10% overlap. Store each chunk as a row. The JSONB metadata column can hold source URL, page number, or timestamp for later filtering.
import { OpenAI } from 'openai';
const openai = new OpenAI();
async function ingest(raw: string) {
const { data } = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: raw,
});
await sql`
INSERT INTO documents (content, embedding)
VALUES (${raw}, ${JSON.stringify(data[0].embedding)}::vector)
`;
}4. Cosine Similarity Search
The <=> operator computes cosine distance. Subtract from 1 to get similarity (0–1). Typical threshold is 0.78; discard results below it. Combine with a metadata WHERE clause to scope searches to a single knowledge base.
async function search(query: string, k = 5) {
const { data } = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query,
});
const results = await sql`
SELECT
id,
content,
metadata,
1 - (embedding <=> ${JSON.stringify(data[0].embedding)}::vector) AS similarity
FROM documents
ORDER BY embedding <=> ${JSON.stringify(data[0].embedding)}::vector
LIMIT ${k}
`;
return results;
}5. Assemble the RAG Prompt
Feed the assembled prompt to GPT-4o or Claude. Keep the system message strict — no hallucination. For production, add a token counter and truncate context to fit the model window minus 1024 tokens for the response.
function buildPrompt(query: string, docs: { content: string }[]) {
const context = docs.map((d, i) => `[doc ${i + 1}] ${d.content}`).join('\n\n');
return `You are a helpful assistant. Answer using ONLY the context below.
If the answer is not in the context, say "I don't know."
Context:
${context}
Question: ${query}`;
}Production Tuning
Index Selection
IVFFlat is fast to build and works well up to ~2M rows. For larger tables, switch to HNSW (USING hnsw) — slower builds but sub-millisecond recall at 10M+ rows. Set ef_search to 100+ for production accuracy.
Chunking Strategy
Use recursive text splitters with a 512-token chunk size and 64-token overlap. For codebases, split on function boundaries. Store parent document IDs so you can retrieve surrounding context when a chunk scores high.
Connection Pooling
Embedding calls are I/O-bound. Use PgBouncer in transaction mode and set a pool size of 20–40. Batch inserts in groups of 50 rows per statement to reduce round-trips.
Hybrid Search
Combine vector similarity with full-text search for keyword-heavy queries. Add a tsvector column and rank results with a weighted sum: 0.7 * vec_score + 0.3 * ts_rank. Reciprocal rank fusion works best.