zelon88 / HRConvert2

A self-hosted, drag-and-drop & nosql file conversion server & share tool that supports 445 file formats in 13 languages.
https://github.com/zelon88/HRConvert2
GNU General Public License v3.0
1.09k stars 63 forks source link

Can't OCR a pdf file #73

Open bit-man opened 5 months ago

bit-man commented 5 months ago

Uploading a PDF file and trying to OCR (method: simple, format : txt) by pressing button Convert into Document opens a new tab with the error Not Found and no file is downloaded

image

At docker console the error show is

172.17.0.1 - - [01/May/2024:20:26:37 +0000] "GET /HRProprietary/HRConvert2/DATA/856ca1146d63/7f10275ffce6/m1m2.txt HTTP/1.1" 404 489 "http://localhost:8080/HRProprietary/HRConvert2/convertCore.php?showFiles=1&gui=Default&language=en&color=blue" "Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0"

Doing tail of txt log at Logs folder shows

Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Initiating Converter.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: User selected to perform OCR on file m1m2.pdf.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Copying file m1m2.pdf to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/b0806464b510/m1m2.pdf.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Copied file m1m2.pdf.
Op-Act, May 1, 2024, 8:36 pm, 856ca1146d63/b0806464b510: Verified file /DATA/HRConvert2/856ca1146d63/b0806464b510/m1m2.txt.
ERROR!!! May 1, 2024, 8:36 pm, HRConvert2-22, 856ca1146d63/b0806464b510: OCR Operation Failed!
bit-man commented 5 months ago

Tryed to follow code at convertCore.php and seems the failing code is at if (!in_array(strtolower($oldExtension), $pdf1array)) . This evaluation results in false and thus no attempt to convert is made which makes no sense to me because its supposed to be the Code to convert a PDF to a document, as stated by the previous line comment

Stripped of the negation and an file si downloaded but is empty :cry: . Still not working The log output follows :

Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Initiating Converter.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: User selected to perform OCR on file m1m2.pdf.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Copying file m1m2.pdf to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.pdf.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Copied file m1m2.pdf.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Verified file /DATA/HRConvert2/856ca1146d63/1029442e5485/m1m2.txt.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Performing OCR intermediate operation using method 0.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Converted file /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.jpg to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.txt.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Performing OCR final using method 0.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Renamed file /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.pdf to /var/www/html/HRProprietary/HRConvert2/DATA/856ca1146d63/1029442e5485/m1m2.txt.
Op-Act, May 1, 2024, 8:43 pm, 856ca1146d63/1029442e5485: Created a file at /DATA/HRConvert2/856ca1146d63/1029442e5485/m1m2.txt.

No time today to do a followup. Will try the weekend or else. Happy if anyone else can continue from here Added this change to https://github.com/bit-man/HRConvert2 in case anyone wants to try a fix

zelon88 commented 4 months ago

Sorry for the delayed response. Can you try the following.....

sudo leafpad /etc/ImageMagick-6/policy.xml

Find and edit the following line.....

<policy domain="coder" rights="none" pattern="PDF" />

.....To.....

<policy domain="coder" rights="read|write" pattern="PDF" />

And let me know the result.

zelon88 commented 3 months ago

I am not satisfied myself with OCR performance of PDF files lately. I've known for some time that the functions for OCR need to be refactored. This is mentioned in CHANGELOG.txt several times, I'm sure of it.

Look for a refactor of the OCR related functions hopefully before v3.4 comes out. This is some of the oldest code left in the codebase today. Most of it pre-dates the v2.7 Valkyre -> Diablo re-write.