yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.8k stars 269 forks source link

page_count undefined method `[]' for nil:NilClass #76

Open sandsfish opened 11 years ago

sandsfish commented 11 years ago

In v1.3, v1.2, v1.0, when I run the code to iterate through all pages:

pdf_textfile = File.open('aero_text.txt', 'w') reader.pages.each do |page| pdf_textfile << page.text end pdf_textfile.close

I get the output:

gems/pdf-reader-1.0.0/lib/pdf/reader.rb:138:in page_count': undefined method[]' for nil:NilClass (NoMethodError) from /Users/sands/.rvm/gems/ruby-1.9.2-p320@newdev/gems/pdf-reader-1.0.0/lib/pdf/reader.rb:224:in pages' from pdf2text.rb:11:in

'

This refers to the pages[] hash being nil for some reason, in reader.rb:

def page_count
  pages = @objects.deref(root[:Pages])
  @page_count ||= pages[:Count]
end

The reader initializes on the pdf file correctly because I can call reader.version and it reports back fine, but getting to the page level (on OS/X 10.8.2) simply doesn't work for this PDF, and no clues as to why are provided by the error message.

Cheers,

Sands Fish

yob commented 11 years ago

Thanks for the report.

To understand the cause I'd really have to see the problem PDF. Are you able to share it with me via email (james@yob.I'd.au)? On 18/01/2013 6:18 AM, "Sands Fish" notifications@github.com wrote:

In v1.3, v1.2, v1.0, when I run the code to iterate through all pages:

pdf_textfile = File.open('aero_text.txt', 'w') reader.pages.each do |page| pdf_textfile << page.text end pdf_textfile.close

I get the output:

gems/pdf-reader-1.0.0/lib/pdf/reader.rb:138:in page_count': undefined method[]' for nil:NilClass (NoMethodError) from /Users/sands/.rvm/gems/ruby-1.9.2-p320@newdev/gems/pdf-reader-1.0.0/lib/pdf/reader.rb:224:in pages' from pdf2text.rb:11:in'

This refers to the pages[] hash being nil for some reason, in reader.rb:

def page_count pages = @objects.deref(root[:Pages]) @page_count ||= pages[:Count] end

The reader initializes on the pdf file correctly because I can call reader.version and it reports back fine, but getting to the page level (on OS/X 10.8.2) simply doesn't work for this PDF, and no clues as to why are provided by the error message.

Cheers,

Sands Fish

— Reply to this email directly or view it on GitHubhttps://github.com/yob/pdf-reader/issues/76.

sandsfish commented 11 years ago

James, does your email address have a single-quote character in it? Doesn't like it in GMail. Will send the PDF once I can.

-S

yob commented 11 years ago

Damn you autocorrect. My address is james@yob.id.au On 20/01/2013 12:57 AM, "Sands Fish" notifications@github.com wrote:

James, does your email address have a single-quote character in it? Doesn't like it in GMail. Will send the PDF once I can.

-S

— Reply to this email directly or view it on GitHubhttps://github.com/yob/pdf-reader/issues/76#issuecomment-12464136.

yob commented 11 years ago

Thanks for the file. If we can discover the underlying issue I'll manually create a new file for a test case and delete your sample.

When I use the pdf_text binary to try and trigger the same issue you're getting, I see a different exception.

⚡ pdf_text foo.pdf
/home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/filter/flate.rb:34:in `rescue in filter': Error occured  hile inflating a compressed stream (Zlib::DataError: invalid distance too far back) (PDF::Reader::MalformedPDFError)
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/filter/flate.rb:17:in `filter'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/stream.rb:63:in `block in unfiltered_data'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/stream.rb:62:in `each'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/stream.rb:62:in `each_with_index'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/stream.rb:62:in `unfiltered_data'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/object_stream.rb:11:in `initialize'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/object_hash.rb:86:in `new'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/object_hash.rb:86:in `[]'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/object_hash.rb:97:in `object'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader.rb:138:in `page_count'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader.rb:225:in `pages'
    from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/bin/pdf_text:11:in `<top (required)>'
    from /home/jh/.gem/ruby/1.9.1/bin/pdf_text:23:in `load'
    from /home/jh/.gem/ruby/1.9.1/bin/pdf_text:23:in `<main>'

Do you get anything like that or always the nil exception? What version of ruby are you running?

sandsfish commented 11 years ago

James, always this one. Version info below...

sands$ ruby pdf2text.rb aeronautics-gravity-reducing-propulsion.pdf PDF Version : 1.6 /gems/pdf-reader-1.0.0/lib/pdf/reader.rb:138:in page_count': undefined method[]' for nil:NilClass (NoMethodError) from /Users/sands/.rvm/gems/ruby-1.9.2-p320@newdev/gems/pdf-reader-1.0.0/lib/pdf/reader.rb:224:in pages' from pdf2text.rb:11:in

'

sands$* ruby -v* ruby 1.9.2p320 (2012-04-20 revision 35421) [x86_64-darwin12.2.0]

sands$ gem list |grep pdf pdf-reader (1.0.0)

On Sun, Jan 20, 2013 at 6:22 AM, James Healy notifications@github.comwrote:

Thanks for the file. If we can discover the underlying issue I'll manually create a new file for a test case and delete your sample.

When I use the pdf_text binary to try and trigger the same issue you're getting, I see a different exception.

⚡ pdf_text aeronautics-gravity-reducing-propulsion.pdf /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/filter/flate.rb:34:in rescue in filter': Error occured hile inflating a compressed stream (Zlib::DataError: invalid distance too far back) (PDF::Reader::MalformedPDFError) from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/filter/flate.rb:17:infilter' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/stream.rb:63:in block in unfiltered_data' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/stream.rb:62:ineach' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/stream.rb:62:in each_with_index' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/stream.rb:62:inunfiltered_data' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/object_stream.rb:11:in initialize' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/object_hash.rb:86:innew' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/object_hash.rb:86:in []' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader/object_hash.rb:97:inobject' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader.rb:138:in page_count' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/lib/pdf/reader.rb:225:inpages' from /home/jh/.gem/ruby/1.9.1/gems/pdf-reader-1.3.0/bin/pdf_text:11:in <top (required)>' from /home/jh/.gem/ruby/1.9.1/bin/pdf_text:23:inload' from /home/jh/.gem/ruby/1.9.1/bin/pdf_text:23:in `

'

Do you get anything like that or always the nil exception? What version of ruby are you running?

— Reply to this email directly or view it on GitHubhttps://github.com/yob/pdf-reader/issues/76#issuecomment-12469100.

yob commented 11 years ago

Can you paste the contents of pdf2text.rb?

sandsfish commented 11 years ago

note that i'm not clear on how to access the page content for the

aggregation i'm attempting, but it errors out before it gets there, so it's moot for now

require 'pdf-reader'

reader = PDF::Reader.new("aeronautics-gravity-reducing-propulsion.pdf")

puts "PDF Version : #{reader.pdf_version}"

pdf_textfile = File.open('aero_text.txt', 'w')

reader.pages.each do |page|
    pdf_textfile << page.text   # or page.raw_content ?
end

pdf_textfile.close

On Tue, Jan 22, 2013 at 4:32 AM, James Healy notifications@github.comwrote:

Can you paste the contents of pdf2text.rb?

— Reply to this email directly or view it on GitHubhttps://github.com/yob/pdf-reader/issues/76#issuecomment-12536550.

yob commented 11 years ago

Unfortunately I can't reproduce this error on my system, so I can't fix it. I'll leave the ticket open in case I have a flash of inspiration.

sorry!

sandsfish commented 11 years ago

Ah, that's too bad. Maybe I can find another system to attempt it on and rule out a part of the stack that might be at fault. On Feb 25, 2013 6:08 AM, "James Healy" notifications@github.com wrote:

Unfortunately I can't reproduce this error on my system, so I can't fix it. I'll leave the ticket open in case I have a flash of inspiration.

sorry!

— Reply to this email directly or view it on GitHubhttps://github.com/yob/pdf-reader/issues/76#issuecomment-14035972.