← Back to docsRecipe

Pull-quote extractor from long content

Extract the most quotable sentences from articles, transcripts, or long-form text using a sliding-window salience scorer. Outputs ranked pull-quotes ready for social cards or editorial highlights.

Ingredients

  • Raw text input (article body, transcript, or markdown)
  • Sentence boundary detector (regex or spaCy)
  • Salience model: TF-IDF, YAKE, or LLM prompt
  • Sliding window (k=3 sentences) with overlap
  • Deduplication pass (cosine similarity threshold)

Method

  1. 1Split input into sentences. Filter fragments shorter than 40 characters.
  2. 2Score each sentence with a keyword-extraction model. Normalize scores to [0,1].
  3. 3Slide a window of 3 sentences across the scored list. Sum window scores; keep the top-N windows.
  4. 4From each winning window, extract the highest-scoring sentence as the pull-quote candidate.
  5. 5Deduplicate candidates using cosine similarity on sentence embeddings. Keep only diverse quotes.
  6. 6Return ranked list with position offsets for inline annotation.

Output schema

[
  {
    "text": "The most salient sentence...",
    "score": 0.87,
    "start_char": 142,
    "end_char": 219
  }
]