RAG with Postgres + pgvector

Build a retrieval-augmented generation pipeline entirely inside PostgreSQL. No Pinecone, no Weaviate — just the database you already run. Embed, index, and search vectors with cosine similarity, then feed the top-k chunks into your LLM.

pgvector 0.7+OpenAI embeddingsIVFFlat indexCosine distance

1. Enable the Extension

Activate pgvector and trigram matching in your Postgres instance. Requires superuser or a cloud provider that bundles the extension (Supabase, Neon, Tembo).

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;

2. Create the Documents Table

1536 matches OpenAI text-embedding-ada-002. Swap to 768 for text-embedding-3-small or 3072 for text-embedding-3-large. The IVFFlat index trades a tiny bit of recall for orders-of-magnitude speed on million-row tables.

CREATE TABLE documents (
  id        BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  content   TEXT NOT NULL,
  metadata  JSONB DEFAULT '{}',
  embedding VECTOR(1536) NOT NULL
);

CREATE INDEX ON documents
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

3. Embed & Insert

Chunk your source text into ~512-token segments with 10% overlap. Store each chunk as a row. The JSONB metadata column can hold source URL, page number, or timestamp for later filtering.

import { OpenAI } from 'openai';

const openai = new OpenAI();

async function ingest(raw: string) {
  const { data } = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: raw,
  });

  await sql`
    INSERT INTO documents (content, embedding)
    VALUES (${raw}, ${JSON.stringify(data[0].embedding)}::vector)
  `;
}

4. Cosine Similarity Search

The <=> operator computes cosine distance. Subtract from 1 to get similarity (0–1). Typical threshold is 0.78; discard results below it. Combine with a metadata WHERE clause to scope searches to a single knowledge base.

async function search(query: string, k = 5) {
  const { data } = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query,
  });

  const results = await sql`
    SELECT
      id,
      content,
      metadata,
      1 - (embedding <=> ${JSON.stringify(data[0].embedding)}::vector) AS similarity
    FROM documents
    ORDER BY embedding <=> ${JSON.stringify(data[0].embedding)}::vector
    LIMIT ${k}
  `;

  return results;
}

5. Assemble the RAG Prompt

Feed the assembled prompt to GPT-4o or Claude. Keep the system message strict — no hallucination. For production, add a token counter and truncate context to fit the model window minus 1024 tokens for the response.

function buildPrompt(query: string, docs: { content: string }[]) {
  const context = docs.map((d, i) => `[doc ${i + 1}] ${d.content}`).join('\n\n');

  return `You are a helpful assistant. Answer using ONLY the context below.
If the answer is not in the context, say "I don't know."

Context:
${context}

Question: ${query}`;
}

Production Tuning

Index Selection

IVFFlat is fast to build and works well up to ~2M rows. For larger tables, switch to HNSW (USING hnsw) — slower builds but sub-millisecond recall at 10M+ rows. Set ef_search to 100+ for production accuracy.

Chunking Strategy

Use recursive text splitters with a 512-token chunk size and 64-token overlap. For codebases, split on function boundaries. Store parent document IDs so you can retrieve surrounding context when a chunk scores high.

Connection Pooling

Embedding calls are I/O-bound. Use PgBouncer in transaction mode and set a pool size of 20–40. Batch inserts in groups of 50 rows per statement to reduce round-trips.

Hybrid Search

Combine vector similarity with full-text search for keyword-heavy queries. Add a tsvector column and rank results with a weighted sum: 0.7 * vec_score + 0.3 * ts_rank. Reciprocal rank fusion works best.