xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
505 stars 73 forks source link

Exotic sheet format impact over table/cell recognition? #152

Open mpsbrazil opened 8 months ago

mpsbrazil commented 8 months ago

Hi, Xavier. Thank you for your library. I'm using it to scratch data from public documents, with good results. Despite the fact the single digit issue from OCR engine continue to bodering us, I'm facing another challenge: recognize de cell even if the OCR doesn't find the text in it.

Take a look on those pictures. First, the original PDF page.

image

Note table is a 6 collumns by 2 rown size at this page, and that was the extracted table in Excel output (a perfect match depite TesseractOCR doesn't recognized those "one digit" number inside cells B3, B4, E3, E4, F3 and F4.)

image

Now the second page of PDF. It keeps its 6 collumns size and may vary at rows number depending of the height of the row.

image

Now, note that the resulting sheet at Excel file doesn't match the 6 collumns size of table, and that's the problem.

image

Could you please confirm this isn't a bug?

xavctn commented 8 months ago

Hello, This is not really supposed to happen. Can you apply the extraction without any OCR and check the number of columns in your table (using the extract_tables method) ? If it is simpler for you, you can just provide me the document.

What I am suspecting is that, as no content is detected from the OCR for those columns, they are getting dropped when the table content is getting populated.