xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
581 stars 76 forks source link

Table definition problem #166

Closed eetap closed 9 months ago

eetap commented 9 months ago

Hello, great library. In some PDF examples, the сv2 library does not quite correctly define the table. For example, in the first PDF file on the second page, the first three tables have merged into one, although there is a visual separation between them. Can you tell me what this might be connected with, and how we can improve the readability of this table? I experimented a little with the median_line_sep parameter of the TableImage class, and found that when median_line_sep = 25, the tables are read correctly, maybe the incorrect calculation of the tables is due to an error in the calculations of this parameter?

1_tab.docx 1_exp.pdf

I was also faced with the fact that if a column in a table is completely filled with the symbol "-", then this column is not included in this table during parsing; I am attaching an example. On the second sheet of the PDF, near the very first table, the second column with the symbol "-" merged with the next column, and the last column was not included in the content of this table at all. Here I could not determine a clear reason why this could happen.

2_exp.pdf 2_table.docx

Photos in Word files, I couldn’t attach clean photos, I also used PaddleOCR("ru") for recognition

xavctn commented 9 months ago

Hello,

For your first point, the median_line_sep parameter is computed automatically based on image characteristics and is used in other parts of the code so I would not advice modifying it. This can be solved in a more general manner on my end with some adjustments to the image processing which is performed.

Second point: this is a known issue on my end as I am checking that there is "content" in at least one cell of the column. However, my method to detect text fails to identify dashes as such as of now.

In conclusion, I am pretty confident that I will fix the first issue but unsure on the second one.

eetap commented 9 months ago

Thanks for the answer. Regarding the second point, can you tell me in which module this check occurs, I would also like to try to find solutions to this problem.

xavctn commented 9 months ago

Thanks for the answer. Regarding the second point, can you tell me in which module this check occurs, I would also like to try to find solutions to this problem.

This is done here. Basically, it performs a connected components analysis in order to isolate text from other artefacts (lines, images...) and to create contours that should represent all the text present within the image.

In general, dashes are excluded either because of their aspect ratio or because their area is too small

xavctn commented 9 months ago

Forgot to mention it but the method used is inspired by this paper principle https://www.mdpi.com/2079-9292/9/1/55

xavctn commented 9 months ago

Hi, A new release has been published and has fixed those issues.