Back to docsRecipe

Recipe: PDF table + form extractor

Extract structured data from fillable PDFs — tables, form fields, and embedded metadata — then pipe the output into a downstream pipeline or database.

Ingredients

pdfplumber or equivalent table-aware PDF parser
pypdf for AcroForm field enumeration
Output adapter (CSV, JSON, or direct DB insert)
Error boundary for malformed PDFs

Steps

1Open the PDF and enumerate all AcroForm fields. Build a field-name → value map.
2Detect table regions on each page using ruling-line or whitespace gap analysis.
3Extract cell text, merge multi-line rows, and validate column counts per row.
4Combine form data and table data into a unified structured record.
5Serialize to the target format and flush to the configured sink.

Notes

• Scanned PDFs require OCR pre-pass; this recipe assumes text-native PDFs.
• Handle XFA forms separately — they store data in XML streams, not AcroForm dicts.
• Large tables benefit from streaming row-by-row to avoid memory pressure.