Recipe

Topic clustering with embeddings

Group unstructured text into coherent topics using embedding vectors and density-based clustering.

Overview

This recipe walks through converting a corpus of documents into embedding vectors, reducing dimensionality, and applying HDBSCAN to discover latent topic clusters without predefining the number of topics.

Ingredients

Raw text corpus (support tickets, reviews, transcripts)
Embedding model (text-embedding-3-small or local)
UMAP for dimensionality reduction
HDBSCAN for density-based clustering
c-TF-IDF for topic label extraction

Steps

Embed documents. Pass each document through your embedding model to produce a fixed-length vector.
Reduce dimensions. Use UMAP to project high-dimensional embeddings into a lower-dimensional space while preserving local structure.
Cluster. Run HDBSCAN on the reduced vectors. Tune min_cluster_size and min_samples to control granularity.
Label topics. Apply c-TF-IDF within each cluster to extract the most representative terms as topic labels.
Validate. Spot- check cluster coherence and adjust parameters iteratively.

Pro tip

If HDBSCAN labels too many documents as noise, lower min_cluster_size or increase the UMAP n_neighbors parameter to capture finer structure.