PDF batch process
Contract, report, invoice —— hundreds PDF one command done
PDF nightmare
Boss say "compile all contract date and amount from these 200 PDF into one sheet". You open first one, find date, copy, switch to Excel, paste. Open second one... finish half day only 40 files, still 160 left. Start doubt life.
Table in PDF worse. Look neat, one copy —— all stuck together. Column mismatch, number and text mixed, format gone. Spend more time fix format than manual input.
Also scan files. Customer send bunch scanned PDF invoice, text cannot select, search more no way. Can only stare at screen type digit by digit manual. Type finish find third invoice amount read wrong, gotta re-check.
OpenClaw can solve three headache things about PDF:
1. Batch information extraction —— tell it "extract date, amount, party A from these contracts", hundreds PDF auto scan through, result output as table direct.
2. Table recognition —— table in PDF recognize convert to Excel, column align, number is number, text is text, no need manual format fix.
3. OCR recognition —— scan files also can process. After recognize text, can search, can extract, can translate.
200 contract info extraction? Before take three day, now one command, come back sip coffee done already.
3 PDF process Prompt, copy can use
Info extraction, OCR convert, batch merge —— most common PDF operation fully cover.
Extract from 50 PDF contract in this folder:
Need extract fields:
1. Contract number
2. Sign date
3. Contract amount (with currency)
4. Party A name
5. Party B name
6. Contract period (start-end date)
7. Payment term (if have)
Output format:
- Generate table, one contract per row
- If certain field not found in contract, mark "not found"
- Final summary: total contracts, total amount, earliest/latest sign date
Note: some contract is scan file (image PDF), need OCR recognize first then extract.
Recognize this scan PDF table and convert to Excel.
Requirement:
1. Use OCR recognize all text and number in table
2. Keep original table row-column structure
3. Recognize number column as number format (not text)
4. Date column uniform format YYYY-MM-DD
5. If have merge cell, keep as is
6. If not certain in recognize, mark with [?]
PDF file: [upload file]
Output: Excel format, first row as header.
Batch merge these PDF file, requirement as follow:
1. Sort rule: sort by file name number part ascending
Example: report_01.pdf → report_02.pdf → report_10.pdf
(note: number sort, not letter sort, 10 after 2)
2. After merge process:
- Add page number bottom right corner every page (format: Page X / Total Y)
- Generate table of content page at begin
- Table of content include every original file name and start page number
3. Output:
- Merged PDF file
- Log file, record which file merged, order, every file page count
Give Python script implement this (use PyPDF2 or reportlab).
PDF process: OpenClaw vs Adobe Acrobat
- Batch info extract is strength —— hundreds PDF one command process
- Extract rule fully custom, want extract what field all can
- Can generate auto script, next time same task one-click reuse
- OCR + info extract + format convert one-stop complete
- PDF edit function powerful —— change text, change picture, change layout all ok
- OCR recognize accuracy very high, especially English document
- Batch function have but operation complex, need learn Action Wizard
- Subscribe per year, price not cheap; info extract ability limited