xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
577 stars 76 forks source link

Table not supported? #163

Closed YAmikep closed 9 months ago

YAmikep commented 9 months ago

Does it support a black background and borderless like in the image below?

I am trying to extract the table from some bank statements to help with tax prep but it returns nothing. Am I doing something wrong or it just does not support these types of tables? If not supported, any advice on how to transform this image first to make it work? Thanks.

table

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
src = "table.png"
doc = Image(src)

# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
                                      implicit_rows=True,
                                      borderless_tables=True,
                                      min_confidence=50)

In [2]: extracted_tables
Out[2]: []

Versions

img2table==1.2.8
opencv-python==4.9.0.80
pytesseract==0.3.10

❯ tesseract --version
tesseract 4.1.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
xavctn commented 9 months ago

Hello, The analysis of documents with dark background is not supported in the current release. However, I have worked on it and will be available in the next release which is coming soon.

xavctn commented 9 months ago
image

I just published the new release. Seems to work OK now, but I had to use PaddleOCR because Tesseract wasn't reading the text properly. You might have to use the library only to get the table location/cells (ie providing no ocr to the function) and match it with results from a better OCR software

YAmikep commented 9 months ago

Nice! 💪

You might have to use the library only to get the table location/cells (ie providing no ocr to the function) and match it with results from a better OCR software

How would you do that? I did not look at the code source but it sounds that img2table does not use the OCR to detect the table then, is that correct?

When passing the ocr parameters, doesn't it use the OCR on every detected cell to resolve the content? Isn't it the same as using the OCR after using img2table?

xavctn commented 9 months ago

I did not look at the code source but it sounds that img2table does not use the OCR to detect the table then, is that correct?

I might not have been super clear. Basically, there are 2 steps :

  1. Tables are detected from the document using computer vision methods. The coordinates of each individual cell are accessible as referenced here. The OCR is never used for table detection
  2. If some tables have been found, the OCR is applied on the whole image (just grayscaled) and word bounding boxes are matched with cell coordinates in order to fill the table content. Rows and columns where no content has been found are deleted from the resulting table.

What I was saying is that you have the possibility to :

This enables you to use another OCR solution or to applied some image processing tailored to your type of documents before passing the image to the OCR in order to get better results.

Don't know if it's clear ^^

YAmikep commented 9 months ago

I see. Thanks, that makes sense.