Closed zealot128 closed 2 years ago
My goal is for only two exceptions to appear when opening/reading a PDF: PDF::Reader::MalformedPDFError
and PDF::Reader::UnsupportedFeatureError
.
Anything else (like FloatDomainError
and ArgumentError
) is a bug that I'd like to fix.
I'm reluctant to add ghostscript as a dependency (even if optional). If you're up for reporting the bugs I'm keen to address them, otherwise I encourage anyone who finds it useful to reprocess through ghostscript (or another tool) in their own code before passing to pdf-reader.
Totally fair! I only wanted to show my solution for this problem, in case any other user has the same problems. We have probably processed about 50.000 documents, most are PDFs, and this Gem + Ghostscript script above has helped us to analyze most of them successfully with only Ruby, and don't have to shell out to some Java jar library anymore.
Hello, thank you for this great library! We are using it to extract text from a bunch of applicant's documents for a ATS (applicant tracking system).
So far, there are about ~5%-10% of documents, that produce errors, when parsing, you might be suprised what kind of pdfs are produced by different software vendors all over the world...
Common Errors are, e.g.:
Most of the times this can be fixed by reprocessing the file with Ghostscript:
e.g.:
This practice is very successful and might be helpful for other people using this library. Maybe worth adding to the README (or providing a self repair flag for the future).