xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
550 stars 75 forks source link

Shuffled text in native PDF #221

Open JbIPS opened 1 month ago

JbIPS commented 1 month ago

Hi,

I'm extracting data from PDF with native text and some rows of the table have their content shuffled, as you can see in this live example or here: image vs image

I'm using Tessaract as OCR but if I understood well, it should not be used since the text is native. I also saw that behavior with some bold text (but not all), I don't know if it's related.

Is there a workaround? Maybe some misused params on my configuration?

Thank you

TianqiWang1 commented 6 days ago

I encountered similar issues when extracting table from PDF - some word orders are reversed. Have you figured this out?

JbIPS commented 6 days ago

I didn't. I tried a workaround with pattern matching because my use case only need to know if a kind of substring exists, but it's harder when the words are in reverse.

Are you using Tessaract too? I don't think it's related but maybe I'm wrong and it's the source of the issue

TianqiWang1 commented 6 days ago

Yes I'm passing in Tesseract too. But same as your case, I assumed native text extraction was actually used.