Closed hifiveszu closed 7 months ago
Hi,
I published a new release of the library where I have made some updates/adjustments to the algorithm. In your first file, table detection should be better.
In terms of speed, I made some optimization to the code that should now run 2 to 4x faster. When it comes to large PDFs, the issue can also be (other than the opencv processing time) that I am caching all images corresponding to each page. It might have some unwanted side-effects on large PDFs, I will have to check
Thanks @xavctn 😊 I tried the new release and found that header parsing has improved significantly, and the speed has also increased a lot. 👍
I have one more small issue: On page 13 of the file I've uploaded. img2table recognized the table on page 12, but it couldn't extract the text, possibly due to a certain degree of text tilt or rotation. 😂
That's logical. In order to detect columns, the algorithm looks for vertical whitespaces in the image (i.e vertical spans where no text is present). In this case, because titles are tilted, they do not correspond (in the whitespace sense) to the table below and are thus not detected as part of the table.
Thanks for your answer. I understand what you mean! I checked the word coordinates and found that they are within the table area, but they are missing from the table extracted. I'll need to find another way to fill it in. 😂 Thank you once again for your generous preaching. 👍 @xavctn
hello @xavctn I tried the new release version 2.1.5 The table recognition is not so good 😂 On page No. 70
Hello, I do not think that this is related to the latest release but more likely a latent issue in the algorithm. I will keep this in mind in order to improve the content segmentation into rows but I cannot promise anything as I do not have, as of now, a clear idea of how to implement a better solution.
@xavctn Thanks!
Hi, I made some modifications that should solve the issue and will be contained in the next release 👍
Hello,
I encountered the following issues while using img2table to parse PDF tables:
result file: 2309.10305.md source pdf: 2309.10305.pdf