Exotic sheet format impact over table/cell recognition?

xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing

MIT License

505 stars 73 forks source link

Hi, Xavier. Thank you for your library. I'm using it to scratch data from public documents, with good results. Despite the fact the single digit issue from OCR engine continue to bodering us, I'm facing another challenge: recognize de cell even if the OCR doesn't find the text in it.

Take a look on those pictures. First, the original PDF page.

Note table is a 6 collumns by 2 rown size at this page, and that was the extracted table in Excel output (a perfect match depite TesseractOCR doesn't recognized those "one digit" number inside cells B3, B4, E3, E4, F3 and F4.)

Now the second page of PDF. It keeps its 6 collumns size and may vary at rows number depending of the height of the row.

Now, note that the resulting sheet at Excel file doesn't match the 6 collumns size of table, and that's the problem.

Could you please confirm this isn't a bug?

xavctn / img2table

Exotic sheet format impact over table/cell recognition? #152