xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
468 stars 69 forks source link

extract tables #133

Closed hifiveszu closed 7 months ago

hifiveszu commented 8 months ago

Hello,

I encountered the following issues while using img2table to parse PDF tables:

  1. The table structure recognition is not accurate, and it cannot identify the headers of the table. Is there any way I can adjust this?
from img2table.document import PDF
doc = PDF(pdf_path)
tables = doc.extract_tables(borderless_tables=True, min_confidence=70)

result file: 2309.10305.md source pdf: 2309.10305.pdf

  1. When there are a large number of pages(maybe 300 pages) in the PDF, the parsing time is significantly longer (since each page needs to be processed using OpenCV). Do you have any suggestions to speed up the parsing process? (I have considered using multiprocessing or similar methods for handling the task.)
xavctn commented 8 months ago

Hi,

I published a new release of the library where I have made some updates/adjustments to the algorithm. In your first file, table detection should be better.

In terms of speed, I made some optimization to the code that should now run 2 to 4x faster. When it comes to large PDFs, the issue can also be (other than the opencv processing time) that I am caching all images corresponding to each page. It might have some unwanted side-effects on large PDFs, I will have to check

hifiveszu commented 8 months ago

Thanks @xavctn 😊 I tried the new release and found that header parsing has improved significantly, and the speed has also increased a lot. 👍

I have one more small issue: On page 13 of the file I've uploaded. img2table recognized the table on page 12, but it couldn't extract the text, possibly due to a certain degree of text tilt or rotation. 😂 image

xavctn commented 8 months ago

That's logical. In order to detect columns, the algorithm looks for vertical whitespaces in the image (i.e vertical spans where no text is present). In this case, because titles are tilted, they do not correspond (in the whitespace sense) to the table below and are thus not detected as part of the table.

hifiveszu commented 8 months ago

Thanks for your answer. I understand what you mean! I checked the word coordinates and found that they are within the table area, but they are missing from the table extracted. I'll need to find another way to fill it in. 😂 Thank you once again for your generous preaching. 👍 @xavctn

hifiveszu commented 8 months ago

hello @xavctn I tried the new release version 2.1.5 The table recognition is not so good 😂 On page No. 70 image image

2307.09288.pdf

xavctn commented 7 months ago

Hello, I do not think that this is related to the latest release but more likely a latent issue in the algorithm. I will keep this in mind in order to improve the content segmentation into rows but I cannot promise anything as I do not have, as of now, a clear idea of how to implement a better solution.

hifiveszu commented 7 months ago

@xavctn Thanks!

xavctn commented 7 months ago

Hi, I made some modifications that should solve the issue and will be contained in the next release 👍