Open JbIPS opened 1 month ago
I encountered similar issues when extracting table from PDF - some word orders are reversed. Have you figured this out?
I didn't. I tried a workaround with pattern matching because my use case only need to know if a kind of substring exists, but it's harder when the words are in reverse.
Are you using Tessaract too? I don't think it's related but maybe I'm wrong and it's the source of the issue
Yes I'm passing in Tesseract too. But same as your case, I assumed native text extraction was actually used.
Hi,
I'm extracting data from PDF with native text and some rows of the table have their content shuffled, as you can see in this live example or here: vs
I'm using Tessaract as OCR but if I understood well, it should not be used since the text is native. I also saw that behavior with some bold text (but not all), I don't know if it's related.
Is there a workaround? Maybe some misused params on my configuration?
Thank you