xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
571 stars 76 forks source link

Suggestion: get raw OCR text for non-table content #216

Open gonarguello opened 2 months ago

gonarguello commented 2 months ago

Oftentimes it is really useful to have all the text that does not belong to tables in the document to make further processing. Maybe, in the same way that the lib extracts 'title' it could extract 'footer'. Or just put all the OCR text that is not part of a table in another attribute, accesible through the 'table' object.

Example: When processing an invoice, the 'invoice items' would come in a 'table' and everything else in 'title' and 'footer' objects to make further (manual) processing of important fields such as date, number, account numbers, etc.