yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Circular references on Page Tree causes PDF::Reader to crash with `SystemStackError` #530

Open tomascco opened 9 months ago

tomascco commented 9 months ago

Pages-tree-refs.pdf (source) Running the following script with the attached PDF renders the following error:

require "bundler/inline"

gemfile do
  gem "pdf-reader"
end

PDF::Reader.new("Pages-tree-refs.pdf").pages
# /usr/local/bundle/gems/pdf-reader-2.12.0/lib/pdf/reader/reference.rb:65:in `hash': stack level too deep (SystemStackError)

This is caused by a circular reference with Page Tree objects:

% ...
1 0 obj
  << /Type /Catalog
     /Pages 2 0 R
  >>
endobj

2 0 obj
  << /Type /Pages
     /Kids [6 0 R 3 0 R]
     /Count 2
     /MediaBox [0 0 595 842]
  >>
endobj

3 0 obj
  << /Type /Pages
     /Kids [4 0 R]
     /Count 1
     /MediaBox [0 0 595 842]
  >>
endobj

4 0 obj
  << /Type /Pages
     /Kids [5 0 R]
     /Count 1
     /MediaBox [0 0 595 842]
  >>
endobj

5 0 obj
  << /Type /Pages
     /Kids [3 0 R]
     /Count 1
     /MediaBox [0 0 595 842]
  >>
endobj
% ...

Here we can observe that 2 0 R is the root, that has two children: 6 0 R and the problematic 3 0 R:

3 0 R --> 4 0 R --> 5 0 R --> 3 0 R <-- the cycle restarts here.

I would like to give an shot to solve this, may I do it?

Context: I've been using PDF::Reader as a dependency of a gem created for my undergraduate thesis (https://github.com/tomascco/rubrik). As part of my research, I've tested PDF::Reader against some of the PDFs on the pdf.js repository (https://github.com/mozilla/pdf.js/tree/master/test/pdfs) and found some cases like this one.

I'd also like give some feedbacks as someone that used PDF::Reader as a dependency for a higher level PDF interface.

Would these patches and suggestions be welcome? @yob

yob commented 9 months ago

Would these patches and suggestions be welcome?

absolutely!