yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Crash with no implicit conversion of nil into String (TypeError) #366

Closed sebbASF closed 2 years ago

sebbASF commented 2 years ago

I have seen a crash at: https://github.com/yob/pdf-reader/blob/6cc5c6663eb83c20ce5c43c06b3db716829ba969/lib/pdf/reader/buffer.rb#L238 no implicit conversion of nil into String (TypeError)

I think this is caused by 'chr = @io.read(1)' at line 235 which can return null at EOF. Sorry, cannot share the PDF which caused this as it contains private info.

sebbASF commented 2 years ago

Looks like the code is trying to find the EI operator for the end of some inline data. The file does have an EI string, but it is not at the start of a line. It is at the end of a line and just before the Q operator which is on its own line.

yob commented 2 years ago

Interesting.

I can manually generate a file that raises that error with prawn like this:

Prawn::Document.generate("hello.pdf") do
  text "This page has an invalid inline image"
  add_content("\n")
  add_content("BI")
  add_content("/W 16")
  add_content("/H 16")
  add_content("/BPC 1")
  add_content("/IM true")
  add_content("ID \xFEEI")
end

The relevant part of the content stream is:

BI
/W 16
/H 16
/BPC 1
/IM true
ID EI
Q

It never matches the first part of this conditional (the buffer doesn't start with white space or a NULL byte): https://github.com/yob/pdf-reader/blob/ecc6ffe22bef6f155526ca57de61d33ce9c2c0a0/lib/pdf/reader/buffer.rb#L233

In your file, what are the first 2-3 bytes after ID?

sebbASF commented 2 years ago

ID is followed by LF and some other bytes: I D lf x . . . . 49 44 0A 78 C3 BA C3 8F However, I don't think that is relevant.

The problem here is that the trailing EI does not follow whitespace or NUL.

It's not clear to me from the spec whether that is a requirement. Also I think there must be some other rules at play here, because in theory the data could contain \sEI several times. The failing PDF does not, but it does contain the pair EI several times. The final EI in the PDF is followed by EOL, but again that sequence could occur naturally in the data.

Assuming that there can only be one ID in a single stream, maybe a reverse search for EI would work on more PDFs? There's unlikely to be much trailing content.

It would be helpful to have some more examples.

Note: the failing PDF has the following entries /Producer (iLovePDF) /ModDate (D:20211005201753Z)

It consists of two pages, the first is text-based, and the second is a large scanned image.

yob commented 2 years ago

Interesting.

Do you think you could use a prawn fragment like mine to construct a public-safe sample file that triggers the bug? The actual bytes in the ID stream can be garbage, we don't need the image to be something sensible for a useful parsing test case.

sebbASF commented 2 years ago

This causes a crash for me:

Prawn::Document.generate("invalidEI.pdf") do
  text "This page has an invalid inline image"
  add_content("\n")
  add_content("BI")
  add_content("/W 16")
  add_content("/H 16")
  add_content("/BPC 1")
  add_content("/IM true")
  add_content("ID")
  add_content("\xFEabcdefgIEI") # EI can occur anywhere in data
  add_content("\xFEabcdefgIEI")
  add_content("\xFEabcdefgIEI")
end
yob commented 2 years ago

It definitely doesn't seem ideal that the logic here can so easily loop endlessly to the end of the file: https://github.com/yob/pdf-reader/blob/ecc6ffe22bef6f155526ca57de61d33ce9c2c0a0/lib/pdf/reader/buffer.rb#L233

Do you have any suggestions for making that logic safer? At the absolute lease it should raise a MalformedPDFError if it gets to the end of the stream and hasn't found an EI. Even earlier than the end of the stream if possible.

However, that aside it seems like there's also cases where the content stream is valid but we're missing the EI.

It would be helpful to have some more examples

It would. However, I'd happily accept a PR that adds a text extraction integration test for the PDF generated by your above comment and gets it green. I'm always happy to expand coverage and spec compliance incrementally as more sample files and bugs are flagged, and confirming the PDF doesn't trigger an exception is a good improvement, even if there's some uncertainty over the precise parsing of the ID bytes.

sebbASF commented 2 years ago

Given that one can go back and forth at will in the source data, there is no need to store the bytes whilst looking for the EI. I see 2 options:

Once the location has been found, the data can be extracted.

Rather than cache the bytes in a temporary buffer, I think I would use a simple state machine to find EI.

Given that EI can in theory appear within the ID data, search from the very beginning (as at present) is likely to be awkward as one would have to look for the last match.

===

Also, maybe update buffer.rb so that it throws an exception when trying to read past EOF? That would stop any endless loop due to failure to check for EOF. (In this case, the nil returned at EOF causes a crash when it is read from the local buffer, but that might not always be the case)

sebbASF commented 2 years ago

I've made some progress on this, however it may be a bit more involved than I first thought. The stream may contain a lot of other commands after the ID/EI, so one cannot skip to the end as I had hoped. However that is just an optimisation.

The current code looks for [\s\0]EI. I think it should look for a line-terminator as well. This will reduce the chance of it appearing in the binary data.

However it won't solve this issue, as the EOL is missing before the EI.

So I propose to look for EI[\r\n]. If this is preceeded by [\s\0], then we are done. If not, keep looking for a 'better' EI until the end of the stream.

I am hoping that it's not possible to have more than 1 ID in a stream. If it is, then additional checks will need to be made in the (hopefully rare) case where we continue to look for a 'better' EI.

I should have a PR ready in a day or so.

yob commented 2 years ago

From my copy of the spec, there's no mention of a per-content stream limit. There is some interesting talk about whitespace though:

Inline image objects shall not be nested; that is, two BI operators shall not appear without an intervening EI to close the first object. Similarly, an ID operator shall only appear between a BI and its balancing EI. Unless the image uses ASCIIHexDecode or ASCII85Decode as one of its filters, the ID operator shall be followed by a single white-space character, and the next character shall be interpreted as the first byte of image data.

It sounds like maybe we should be using standard tokenising logic to look for the end of the data?

sebbASF commented 2 years ago

Maybe the code should look for BI ... EI and handle that as a separate object then?

However, there is still the issue of what constitutes a valid EI token. Currently it only looks for whitespace/NUL before the EI I suspect it should also check for EOL (perhaps EOF) afterwards, to reduce the chance of picking up EI in a data stream.

Can a PDF use any whitespace after EI, or must it be EOL? (Later: there are plenty of examples with xxxx\nEI Q, so in that case only a space is needed. But is a space sufficent after \<sp>EI or \<nul>EI?)

To support the iLovePDF case, the code should definitely look for EOL after EI.

sebbASF commented 2 years ago

There's another potential issue here: in other parts of the spec, CRLF is treated as a single EOL character. Elsewhere this means that both characters are dropped from the preceding token. That does not happen here currently.

sebbASF commented 2 years ago

With the latest change (7def970) to PR#379, the code:

elsif prevchr == NULL_BYTE can be replaced by else to allow any char before EI.

This fixes the crash that instigated this issue, and does not break any tests.

I'll try some more tests with our collection of PDFs over the next few days, but it is looking promising.

sebbASF commented 2 years ago

My tests did not find any issues, so I have proposed the PR as above

sebbASF commented 2 years ago

Now fixed