xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
577 stars 76 forks source link

Extracted table is returning contents for empty cells #146

Closed xansrnitu closed 9 months ago

xansrnitu commented 11 months ago

Hi, I am having an issue where in if i have a row having only the first column(cell) populated but all the remaining columns of the same row are empty, the first col cell's value is being returned for all the empty cells as well. Here is my code -

ocr = EasyOCR(lang=["en"])
img = Image2(src=img_path)
tables = img.extract_tables(ocr=ocr, implicit_rows=True)      
table=tables.pop()
table.df

Is it possible to not have this behaviour? I want the empty cells to return empty(no) value.

I have attached both the source image and output screenshot. issue table1

Thank you for this useful library!

xavctn commented 10 months ago

Hello, This is most likely because the vertical lines in the second row failed to be detected.

Are you able to provide the original document so that I can take a look at what is going wrong ?

xansrnitu commented 10 months ago

Hi @xavctn , Sure. Here it is. source.pdf Thanks for having a look!

xavctn commented 9 months ago

Hello, I have created a new release that fixes the issue, it should be good now.