Record deduplication via embeddings
Detect near-duplicate rows in large datasets by comparing vector embeddings instead of exact string matching.
Problem
CRM imports, log ingestion, and user signups often produce records that differ by whitespace, casing, or typos. Exact deduplication misses these.
Approach
- Generate an embedding vector for each record.
- Index vectors with an approximate nearest-neighbor store.
- Query each vector against the index; flag pairs below a cosine-similarity threshold.
- Merge or discard flagged duplicates.
Stack
text-embedding-3-small
OpenAI embeddings
pgvector
Postgres ANN index
cosine_distance
Similarity metric
0.92
Threshold
Trade-offs
- Embedding cost scales linearly with record count.
- ANN indices trade recall for speed; tune probes.
- Threshold requires calibration per domain.
Full implementation with batch embedding and incremental indexing available in the Meridian pipeline reference.