yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Getting mangled characters / mixed words #351

Closed MJCune1 closed 3 years ago

MJCune1 commented 3 years ago

Hi,

I’m trying to extract text with last version of the gem (2.4.2) but I’m getting mangled characters / mixed words so I decided to re-export the file to pdf as a new option and then it’s correctly read by the gem. The fact is it’ll be great to avoid this re-exporting process for user.

Result from the original file:

[2] pry(main)> reader.pages.first.text
=> "0     8.224          60.576    537.500     RODRIGUEZ CR. Z9.99NA -9L JHON\n\n"

If I export the file to pdf again I got this result:

[4] pry(main)> reader.pages.first.text
=> "99.999.999-9 RODRIGUEZ PEREZ RONAL JHON      537.500        60.576        8.224"

I got some hint from the general information of the pdf in the OS, the encoding software for the first one is iText 4.2.0 by 1T3XT and for the exported one macOS Version 11.3.1 (Build 20E241) Quartz PDFContext but I’m not sure yet if it has something to do as I’ve been checking the encoding for both files and it’s UTF-8after being processed.

yob commented 3 years ago

Hi @MJCune1,

Unfortunately debugging the issue requires a copy of the PDF. Are you able to share it, possibly via email to me directly if it's sensitive? My personal email can be found on my website, via the URL on my GitHub profile.

MJCune1 commented 3 years ago

Thanks! I already sent you a copy to your email.

yob commented 3 years ago

Thanks @MJCune1. You're in luck - the sample PDF you provided looks like it's parsed correctly by the fix in #350 that I merged just a few days ago. Spooky timing :ghost:

I haven't published a release with that fix yet, but I hope to do so soon. Are you able to load the gem via git in the meantime with this in your Gemfile?

gem 'pdf-reader', github: 'yob/pdf-reader'
MJCune1 commented 3 years ago

Thanks!