xavctn / img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
MIT License
562 stars 75 forks source link

Error while trying to convert image to dataframe #21

Closed 004Ajay closed 1 year ago

004Ajay commented 1 year ago

I tried to extract info from an image with table to dataframe but an error pops up:

CalledProcessError: Command 'tesseract C:\Users\ASUS\AppData\Local\Temp\tmpk2lb97t2.jpg stdout --psm 11 -l eng hocr' returned non-zero exit status 1.

I searched for solutions but nothing worked, anyone please help solve this

Code used

image used

xavctn commented 1 year ago

Hi,

I tried to reproduce your error but without success. Your error is raised when the Tesseract CLI is called with subprocess. I suspect that it might be linked to your Tesseract version/installation as I have only tested the code with Tesseract 5.

In order to get a better understanding of what is going on, can you :

As a reference for this kind of errors, I have found this documentation, gonna see if it is helpful.

xavctn commented 1 year ago

FYI: on this kind of images, you should try to use Paddle instead of Tesseract. Tesseract fails to recognize the text while I manage to get the proper output dataframe with paddle

xavctn commented 1 year ago

I did some more testing yesterday on Windows using Tesseract 4/5 and got no error using both versions.

004Ajay commented 1 year ago

FYI: on this kind of images, you should try to use Paddle instead of Tesseract. Tesseract fails to recognize the text while I manage to get the proper output dataframe with paddle

I tried using paddle, but It also shows an error

error: ImportError: cannot import name 'PaddleOCR' from 'img2table.ocr' (C:\Users\ASUS\AppData\Local\Programs\Python\Python310\lib\site-packages\img2table\ocr__init__.py)

Did you get the output dataframe from my image or any other image? I planned to use pytesseract because i saw in the internet that we could train it with custom images and use that model to do ocr (traineddata).

004Ajay commented 1 year ago

I did some more testing yesterday on Windows using Tesseract 4/5 and got no error using both versions.

i was working with a problem for a month now, and one of the images is this one . This is an image of a student's marksheet, I need to automate the process of digitizing these marks in csv format for documentation purpose of the college. I need to make the code work on pdf containing 60+ images like the one above, I just need to extract the handwritten marks only and add it into a csv file for this purpose, I need table recognition (more precisely cell wise recognition, as I need to go through each cells column-wise as marks may be given in different places in that box like 3a, 7c etc...and do ocr to get the digit from the handwritten number) table recognition worked, but I need to go through each cells and do ocr, do you have any idea about custom training of Paddle or tesseract? or can you help me solving this problem?

xavctn commented 1 year ago

As indicated in the README, run pip install img2table[paddle] if you want to use PaddleOCR.

As for Tesseract, pytesseract is just a pythonic command line wrapper for Tesseract and won't solve your issue. Download relevant trained data models here and set TESSDATA_PREFIX environment variable pointing to the directory where traineddata files are stored. However, Tesseract is not able to detect properly handwritten text so I would advise you against using it.

With Paddle, that's what I get for your original image.

res

On this issue, I will implement checks for validity of Tesseract installation when instancing the TesseractOCR class.