Back to docsRecipe

Recipe: PDF table + form extractor

Extract structured data from fillable PDFs — tables, form fields, and embedded metadata — then pipe the output into a downstream pipeline or database.

Ingredients

  • pdfplumber or equivalent table-aware PDF parser
  • pypdf for AcroForm field enumeration
  • Output adapter (CSV, JSON, or direct DB insert)
  • Error boundary for malformed PDFs

Steps

  1. 1Open the PDF and enumerate all AcroForm fields. Build a field-name → value map.
  2. 2Detect table regions on each page using ruling-line or whitespace gap analysis.
  3. 3Extract cell text, merge multi-line rows, and validate column counts per row.
  4. 4Combine form data and table data into a unified structured record.
  5. 5Serialize to the target format and flush to the configured sink.

Notes

  • • Scanned PDFs require OCR pre-pass; this recipe assumes text-native PDFs.
  • • Handle XFA forms separately — they store data in XML streams, not AcroForm dicts.
  • • Large tables benefit from streaming row-by-row to avoid memory pressure.