Pull-quote extractor from long content
Extract the most quotable sentences from articles, transcripts, or long-form text using a sliding-window salience scorer. Outputs ranked pull-quotes ready for social cards or editorial highlights.
Ingredients
- Raw text input (article body, transcript, or markdown)
- Sentence boundary detector (regex or spaCy)
- Salience model: TF-IDF, YAKE, or LLM prompt
- Sliding window (k=3 sentences) with overlap
- Deduplication pass (cosine similarity threshold)
Method
- 1Split input into sentences. Filter fragments shorter than 40 characters.
- 2Score each sentence with a keyword-extraction model. Normalize scores to [0,1].
- 3Slide a window of 3 sentences across the scored list. Sum window scores; keep the top-N windows.
- 4From each winning window, extract the highest-scoring sentence as the pull-quote candidate.
- 5Deduplicate candidates using cosine similarity on sentence embeddings. Keep only diverse quotes.
- 6Return ranked list with position offsets for inline annotation.
Output schema
[
{
"text": "The most salient sentence...",
"score": 0.87,
"start_char": 142,
"end_char": 219
}
]