Closed 004Ajay closed 1 year ago
Hi,
I tried to reproduce your error but without success. Your error is raised when the Tesseract CLI is called with subprocess. I suspect that it might be linked to your Tesseract version/installation as I have only tested the code with Tesseract 5.
In order to get a better understanding of what is going on, can you :
tesseract path_to_your_local_file stdout --psm 11 -l eng hocr
in your terminal and check if it is processed properly.tesseract --version
and tesseract --list-langs
and report the output to this issue.As a reference for this kind of errors, I have found this documentation, gonna see if it is helpful.
FYI: on this kind of images, you should try to use Paddle instead of Tesseract. Tesseract fails to recognize the text while I manage to get the proper output dataframe with paddle
I did some more testing yesterday on Windows using Tesseract 4/5 and got no error using both versions.
FYI: on this kind of images, you should try to use Paddle instead of Tesseract. Tesseract fails to recognize the text while I manage to get the proper output dataframe with paddle
I tried using paddle, but It also shows an error
error: ImportError: cannot import name 'PaddleOCR' from 'img2table.ocr' (C:\Users\ASUS\AppData\Local\Programs\Python\Python310\lib\site-packages\img2table\ocr__init__.py)
Did you get the output dataframe from my image or any other image? I planned to use pytesseract because i saw in the internet that we could train it with custom images and use that model to do ocr (traineddata).
I did some more testing yesterday on Windows using Tesseract 4/5 and got no error using both versions.
i was working with a problem for a month now, and one of the images is this one . This is an image of a student's marksheet, I need to automate the process of digitizing these marks in csv format for documentation purpose of the college. I need to make the code work on pdf containing 60+ images like the one above, I just need to extract the handwritten marks only and add it into a csv file for this purpose, I need table recognition (more precisely cell wise recognition, as I need to go through each cells column-wise as marks may be given in different places in that box like 3a, 7c etc...and do ocr to get the digit from the handwritten number) table recognition worked, but I need to go through each cells and do ocr, do you have any idea about custom training of Paddle or tesseract? or can you help me solving this problem?
As indicated in the README, run pip install img2table[paddle]
if you want to use PaddleOCR.
As for Tesseract, pytesseract is just a pythonic command line wrapper for Tesseract and won't solve your issue.
Download relevant trained data models here and set TESSDATA_PREFIX
environment variable pointing to the directory where traineddata files are stored.
However, Tesseract is not able to detect properly handwritten text so I would advise you against using it.
With Paddle, that's what I get for your original image.
On this issue, I will implement checks for validity of Tesseract installation when instancing the TesseractOCR
class.
I tried to extract info from an image with table to dataframe but an error pops up:
CalledProcessError: Command 'tesseract C:\Users\ASUS\AppData\Local\Temp\tmpk2lb97t2.jpg stdout --psm 11 -l eng hocr' returned non-zero exit status 1.
I searched for solutions but nothing worked, anyone please help solve this
Code used
image used