Reader: LZW: create_new_string: Coerce nil string_table code to string

bcoles commented 2 years ago

I was looking through some of the crashes identified by the fuzzer and this looked like an easy fix, but relying on your experience with PDF format to confirm this is a reasonable fix.

Perhaps LZW decompression should be a fatal error. However, other PDF readers will at least render the document, even if it is clearly broken:

20220417004840082938521_crash_346.pdf

crashes/20220417004840082938521_crash_346.pdf.trace-1-undefined method `+' for nil:NilClass
crashes/20220417004840082938521_crash_346.pdf.trace:2:/var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/lzw.rb:123:in `create_new_string'

$ ./tools/read-pdf.rb crashes/20220417004840082938521_crash_346.pdf
[*] Processing 'crashes/20220417004840082938521_crash_346.pdf'
[+] Processing complete
[*] Parsing 'crashes/20220417004840082938521_crash_346.pdf'
[*] Version: 1.4
[*] Info: {:Creator=>"pdftk 1.41 - www.pdftk.com", :Producer=>"iText 2.1.7 by 1T3XT", :ModDate=>"D:20101113061857-06'00'", :CreationDate=>"D:20101113061857-06'00'"}
[*] Metadata: 
[*] Objects: <PDF::Reader::ObjectHash size: 8>
[*] Pages: 1
[*] Parsing PDF contents...
Traceback (most recent call last):
    16: from ./tools/read-pdf.rb:109:in `<main>'
    15: from ./tools/read-pdf.rb:59:in `read'
    14: from ./tools/read-pdf.rb:81:in `parse'
    13: from ./tools/read-pdf.rb:81:in `each'
    12: from ./tools/read-pdf.rb:82:in `block in parse'
    11: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:115:in `text'
    10: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:159:in `walk'
     9: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:169:in `raw_content'
     8: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:169:in `map'
     7: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/page.rb:170:in `block in raw_content'
     6: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/stream.rb:64:in `unfiltered_data'
     5: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/stream.rb:64:in `each_with_index'
     4: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/stream.rb:64:in `each'
     3: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/stream.rb:65:in `block in unfiltered_data'
     2: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/filter/lzw.rb:18:in `filter'
     1: from /var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/lzw.rb:101:in `decode'
/var/lib/gems/2.7.0/gems/pdf-reader-2.9.2/lib/pdf/reader/lzw.rb:123:in `create_new_string': undefined method `+' for nil:NilClass (NoMethodError)

After this PR the page text contains PDF data. This is in line with behaviour of other PDF readers.

$ ./tools/read-pdf.rb crashes/20220417004840082938521_crash_346.pdf
[*] Processing 'crashes/20220417004840082938521_crash_346.pdf'
[+] Processing complete
[*] Parsing 'crashes/20220417004840082938521_crash_346.pdf'
[*] Version: 1.4
[*] Info: {:Creator=>"pdftk 1.41 - www.pdftk.com", :Producer=>"iText 2.1.7 by 1T3XT", :ModDate=>"D:20101113061857-06'00'", :CreationDate=>"D:20101113061857-06'00'"}
[*] Metadata: 
[*] Objects: <PDF::Reader::ObjectHash size: 8>
[*] Pages: 1
[*] Parsing PDF contents...
[*] Page text
%PDF-1.2
%âãÏÓ

10 0 obj
<<
/OOOngth 11 0 R

/Filter /OZWDecode
>>
stream
 €▯Š€Ñ˜Ôf.▯
[*] Page fonts
{:F0=>{:FirstChar=>32, :Encoding=>:WinAnsiEncoding, :Name=>:F0, :FontDescriptor=>{:FontName=>:CourierNew, :StemV=>109, :Leading=>133, :Ascent=>833, :Flags=>34, :XHeight=>417, :AvgWidth=>600, :Descent=>-300, :ItalicAngle=>0, :StemH=>109, :MissingWidth=>595, :"MaxW^dth"=>595, :FontBBox=>[-250, -300, 714, 1000], :Type=>:FontDescriptor, :CapHeight=>833}, :BaseFont=>:CourierNew, :Subtype=>:TrueType, :LastChar=>255, :Widths=>[595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595, 595], :Type=>:Font}}
[*] Page raw_content
ET031) Tj(\321\230\324f.) Tj0 0 obj) Tj
[+] Parsing complete

yob commented 2 years ago

As a first pass at avoiding the crash, I think I'd prefer to go for strictness and raise a MalformedPDFError (see #454).

It's an interesting question though. My typical approach has been to conform to the common behaviour of other readers in most things.... other than in quietly skipping isolated invalid parts of a file that don't prevent the rest of the file being parsed. Maybe I should rethink that approach?

bcoles commented 2 years ago

It's an interesting question though. My typical approach has been to conform to the common behaviour of other readers in most things.... other than in quietly skipping isolated invalid parts of a file that don't prevent the rest of the file being parsed. Maybe I should rethink that approach?

Strictness is likely the best approach.

I didn't bother to look into the root cause of this specific issue. It depends on whether there is recoverable data that is lost/obscured.

As pdf-reader is for content parsing, more so than rendering, it is unreasonable to compare use cases with the traditional liberal parsing employed by PDF viewers which attempt to perform a best-effort render of the document no matter what.

When leveraging a PDF parsing library, I'd like the option to access malformed PDF data that has been skipped even if it isn't included by default. For example:

I don't consider invalid fonts to be fatal (#453)
Elsewhere on the issue tracker there is discussion about discarding text rendered in tiny font. That's the kind of thing that I'd like to at least know about if it exists, and have some way to access it, even if it isn't included in the page_text by default.

Admittedly this does not fit any of my current use cases.

yob / pdf-reader

Reader: LZW: create_new_string: Coerce nil string_table code to string #449