Proposal/Info on PDF errors -> Reprocess with Ghostscript (Info/Doc)

zealot128 commented 3 years ago

Hello, thank you for this great library! We are using it to extract text from a bunch of applicant's documents for a ATS (applicant tracking system).

So far, there are about ~5%-10% of documents, that produce errors, when parsing, you might be suprised what kind of pdfs are produced by different software vendors all over the world...

Common Errors are, e.g.:

FloatDomainError, PDF::Reader::MalformedPDFError, ArgumentError

Most of the times this can be fixed by reprocessing the file with Ghostscript:

e.g.:

def read(pdf)
  analyze_with_pdf_reader(pdf)
rescue PDF::Reader::MalformedPDFError, FloatDomainError, ArgumentError => exception
  reprocess_with_ghostscript(pdf, exception) do |tf|
    analyze_with_pdf_reader(tf)
  end
end

# reprocess file with ghostscript, if not successful raise original exception
def reprocess_with_ghostscript(pdf, original_exception)
  tf = Tempfile.new(['repair', 'file.pdf'])
  tf.binmode
  o, e, s = Open3.capture3("gs -o #{tf.path} -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress #{pdf.path}")
  if s.success? && File.size(tf.path) > 0
    tf.rewind
    yield(tf)
  else
    raise original_exception
  end
rescue PDF::Reader::MalformedPDFError => e
  if e.message == "PDF does not contain EOF marker" &&  File.read(pdf)['Mac OS']
    raise StandardError.new("Is a MacOS Meta File")
  else
    raise e
  end
end

def analyze_with_pdf_reader(pdf)
 reader = PDF::Reader.new(pdf)
  page_content = reader.pages.each_with_index.map do |page, index|
    page.text + "\n\n#<--PAGE BREAK #{(index + 1).to_s)}\n\n"
  rescue FloatDomainError, PDF::Reader::MalformedPDFError
   # if an indivudal page raises, maybe just skip it -> usually just scanned stuff
    nil
  end
  page_content.compact.join("\n").gsub("\u0000", '')
end

This practice is very successful and might be helpful for other people using this library. Maybe worth adding to the README (or providing a self repair flag for the future).

yob commented 2 years ago

My goal is for only two exceptions to appear when opening/reading a PDF: PDF::Reader::MalformedPDFError and PDF::Reader::UnsupportedFeatureError.

Anything else (like FloatDomainError and ArgumentError) is a bug that I'd like to fix.

I'm reluctant to add ghostscript as a dependency (even if optional). If you're up for reporting the bugs I'm keen to address them, otherwise I encourage anyone who finds it useful to reprocess through ghostscript (or another tool) in their own code before passing to pdf-reader.

zealot128 commented 2 years ago

Totally fair! I only wanted to show my solution for this problem, in case any other user has the same problems. We have probably processed about 50.000 documents, most are PDFs, and this Gem + Ghostscript script above has helped us to analyze most of them successfully with only Ruby, and don't have to shell out to some Java jar library anymore.

yob / pdf-reader

Proposal/Info on PDF errors -> Reprocess with Ghostscript (Info/Doc) #339