Recipe

Record deduplication via embeddings

Detect near-duplicate rows in large datasets by comparing vector embeddings instead of exact string matching.

Problem

CRM imports, log ingestion, and user signups often produce records that differ by whitespace, casing, or typos. Exact deduplication misses these.

Approach

  1. Generate an embedding vector for each record.
  2. Index vectors with an approximate nearest-neighbor store.
  3. Query each vector against the index; flag pairs below a cosine-similarity threshold.
  4. Merge or discard flagged duplicates.

Stack

text-embedding-3-small

OpenAI embeddings

pgvector

Postgres ANN index

cosine_distance

Similarity metric

0.92

Threshold

Trade-offs

  • Embedding cost scales linearly with record count.
  • ANN indices trade recall for speed; tune probes.
  • Threshold requires calibration per domain.

Full implementation with batch embedding and incremental indexing available in the Meridian pipeline reference.