yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.82k stars 271 forks source link

Unable to extract text from pdf/a (with flat decode) #329

Open celsowm opened 4 years ago

celsowm commented 4 years ago

Hi ! I have tried this pdf, with this code:


require 'rubygems'
require 'pdf/reader'

filename = "pdfa.pdf"

PDF::Reader.open(filename) do |reader|
  reader.pages.each do |page|
    puts page.text
  end
end

But the result was something like:

                
    
                                                                                                                 
                                                                                  
                
  

                                                                                                                                  
                      

Is there any way to extract text from it?

yob commented 4 years ago

I get the same results when trying to extract text using pdf-reader.

I also tried extracting text with pdftotext (which uses libpoppler), and firefox (which uses pdf.js). Neither of them worked either.

I haven't checked the PDF contents in detail, but I'm if poppler and pdf.js have trouble then I suspect it's a broken file.