← Back to Docs
Recipe

Topic clustering with embeddings

Group unstructured text into coherent topics using embedding vectors and density-based clustering.

Overview

This recipe walks through converting a corpus of documents into embedding vectors, reducing dimensionality, and applying HDBSCAN to discover latent topic clusters without predefining the number of topics.

Ingredients

  • Raw text corpus (support tickets, reviews, transcripts)
  • Embedding model (text-embedding-3-small or local)
  • UMAP for dimensionality reduction
  • HDBSCAN for density-based clustering
  • c-TF-IDF for topic label extraction

Steps

  1. Embed documents. Pass each document through your embedding model to produce a fixed-length vector.
  2. Reduce dimensions. Use UMAP to project high-dimensional embeddings into a lower-dimensional space while preserving local structure.
  3. Cluster. Run HDBSCAN on the reduced vectors. Tune min_cluster_size and min_samples to control granularity.
  4. Label topics. Apply c-TF-IDF within each cluster to extract the most representative terms as topic labels.
  5. Validate. Spot- check cluster coherence and adjust parameters iteratively.

Pro tip

If HDBSCAN labels too many documents as noise, lower min_cluster_size or increase the UMAP n_neighbors parameter to capture finer structure.