Recipe
Topic clustering with embeddings
Group unstructured text into coherent topics using embedding vectors and density-based clustering.
Overview
This recipe walks through converting a corpus of documents into embedding vectors, reducing dimensionality, and applying HDBSCAN to discover latent topic clusters without predefining the number of topics.
Ingredients
- Raw text corpus (support tickets, reviews, transcripts)
- Embedding model (text-embedding-3-small or local)
- UMAP for dimensionality reduction
- HDBSCAN for density-based clustering
- c-TF-IDF for topic label extraction
Steps
- Embed documents. Pass each document through your embedding model to produce a fixed-length vector.
- Reduce dimensions. Use UMAP to project high-dimensional embeddings into a lower-dimensional space while preserving local structure.
- Cluster. Run HDBSCAN on the reduced vectors. Tune min_cluster_size and min_samples to control granularity.
- Label topics. Apply c-TF-IDF within each cluster to extract the most representative terms as topic labels.
- Validate. Spot- check cluster coherence and adjust parameters iteratively.
Pro tip
If HDBSCAN labels too many documents as noise, lower min_cluster_size or increase the UMAP n_neighbors parameter to capture finer structure.