Back to docsRecipe
Recipe: PDF table + form extractor
Extract structured data from fillable PDFs — tables, form fields, and embedded metadata — then pipe the output into a downstream pipeline or database.
Ingredients
pdfplumberor equivalent table-aware PDF parserpypdffor AcroForm field enumeration- Output adapter (CSV, JSON, or direct DB insert)
- Error boundary for malformed PDFs
Steps
- 1Open the PDF and enumerate all AcroForm fields. Build a field-name → value map.
- 2Detect table regions on each page using ruling-line or whitespace gap analysis.
- 3Extract cell text, merge multi-line rows, and validate column counts per row.
- 4Combine form data and table data into a unified structured record.
- 5Serialize to the target format and flush to the configured sink.
Notes
- • Scanned PDFs require OCR pre-pass; this recipe assumes text-native PDFs.
- • Handle XFA forms separately — they store data in XML streams, not AcroForm dicts.
- • Large tables benefit from streaming row-by-row to avoid memory pressure.