yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

TypeOneOrThree: glyph_width: Return font missing_width if font is empty #448

Closed bcoles closed 2 years ago

bcoles commented 2 years ago

I was looking through some of the crashes identified by the fuzzer and this looked like an easy fix, but relying on your experience with PDF format to confirm this is a reasonable fix.

It looks like a nil check was overlooked in this file as the same code pattern is used correctly here:

https://github.com/yob/pdf-reader/blob/a791d4159b022cc34c21eaf1dcae7ca43bf3aa4a/lib/pdf/reader/width_calculator/true_type.rb#L33


20220417003603745582184_crash_658.pdf

crashes/20220417003603745582184_crash_658.pdf.trace:1:undefined method `<=' for nil:NilClass
crashes/20220417003603745582184_crash_658.pdf.trace-2-/var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/width_calculator/type_one_or_three.rb:26:in `glyph_width'
./tools/read-pdf.rb crashes/20220417003603745582184_crash_658.pdf

[...]

[*] Parsing PDF contents...
Traceback (most recent call last):
    20: from ./tools/read-pdf.rb:109:in `<main>'
    19: from ./tools/read-pdf.rb:59:in `read'
    18: from ./tools/read-pdf.rb:81:in `parse'
    17: from ./tools/read-pdf.rb:81:in `each'
    16: from ./tools/read-pdf.rb:82:in `block in parse'
    15: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:115:in `text'
    14: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:159:in `walk'
    13: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:254:in `content_stream'
    12: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:267:in `callback'
    11: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:267:in `each'
    10: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:268:in `block in callback'
     9: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/validating_receiver.rb:171:in `show_text'
     8: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/validating_receiver.rb:258:in `call_wrapped'
     7: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page_text_receiver.rb:83:in `show_text'
     6: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page_text_receiver.rb:129:in `internal_show_text'
     5: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page_text_receiver.rb:129:in `each_with_index'
     4: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page_text_receiver.rb:129:in `each'
     3: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page_text_receiver.rb:138:in `block in internal_show_text'
     2: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/font.rb:82:in `glyph_width_in_text_space'
     1: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/font.rb:74:in `glyph_width'
/var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/width_calculator/type_one_or_three.rb:26:in `glyph_width': undefined method `<=' for nil:NilClass (NoMethodError)

After this PR, read-pdf progresses further, parses the fonts, and crashes later in raw_content with [-] Could not parse PDF 'crashes/20220417003603745582184_crash_658.pdf': PDF is malformed.

bcoles commented 2 years ago

Thanks!

:+1:

My hope has been that gradually adopting sorbet typing will help flush this kind of silly bug out It's taking a long time for the effort to pay that dividend though.

I haven't looked into the sorbet implementation - or most changes to the code base since 2019 for that matter.

I'm sure it will be worth it in the end. As you suggest, a huge proportion of the bugs caught by the dumb fuzzer are type related.

I'm also sure there will be all sorts of weird and wonderful bugs to be discovered once robust type checking has been implemented.

without needing an infinite number of sample PDFs and test coverage.

He who willingly parses PDF files, willingly forgoes sanity.

yob commented 2 years ago

He who willingly parses PDF files, willingly forgoes sanity

Where were you 10 years ago? ;)

bcoles commented 2 years ago

He who willingly parses PDF files, willingly forgoes sanity

Where were you 10 years ago? ;)

Probably thinking "Better you than me, I'll come back and fuzz this in a few years."