← Back to docsRecipe

Pull-quote extractor from long content

Extract the most quotable sentences from articles, transcripts, or long-form text using a sliding-window salience scorer. Outputs ranked pull-quotes ready for social cards or editorial highlights.

Ingredients

Raw text input (article body, transcript, or markdown)
Sentence boundary detector (regex or spaCy)
Salience model: TF-IDF, YAKE, or LLM prompt
Sliding window (k=3 sentences) with overlap
Deduplication pass (cosine similarity threshold)

Method

1Split input into sentences. Filter fragments shorter than 40 characters.
2Score each sentence with a keyword-extraction model. Normalize scores to [0,1].
3Slide a window of 3 sentences across the scored list. Sum window scores; keep the top-N windows.
4From each winning window, extract the highest-scoring sentence as the pull-quote candidate.
5Deduplicate candidates using cosine similarity on sentence embeddings. Keep only diverse quotes.
6Return ranked list with position offsets for inline annotation.

Output schema

[
  {
    "text": "The most salient sentence...",
    "score": 0.87,
    "start_char": 142,
    "end_char": 219
  }
]

More recipes Try in dashboard