Closed sebbASF closed 2 years ago
Looks like the code is trying to find the EI operator for the end of some inline data. The file does have an EI string, but it is not at the start of a line. It is at the end of a line and just before the Q operator which is on its own line.
Interesting.
I can manually generate a file that raises that error with prawn like this:
Prawn::Document.generate("hello.pdf") do
text "This page has an invalid inline image"
add_content("\n")
add_content("BI")
add_content("/W 16")
add_content("/H 16")
add_content("/BPC 1")
add_content("/IM true")
add_content("ID \xFEEI")
end
The relevant part of the content stream is:
BI
/W 16
/H 16
/BPC 1
/IM true
ID EI
Q
It never matches the first part of this conditional (the buffer doesn't start with white space or a NULL byte): https://github.com/yob/pdf-reader/blob/ecc6ffe22bef6f155526ca57de61d33ce9c2c0a0/lib/pdf/reader/buffer.rb#L233
In your file, what are the first 2-3 bytes after ID
?
ID is followed by LF and some other bytes: I D lf x . . . . 49 44 0A 78 C3 BA C3 8F However, I don't think that is relevant.
The problem here is that the trailing EI does not follow whitespace or NUL.
It's not clear to me from the spec whether that is a requirement. Also I think there must be some other rules at play here, because in theory the data could contain \sEI several times. The failing PDF does not, but it does contain the pair EI several times. The final EI in the PDF is followed by EOL, but again that sequence could occur naturally in the data.
Assuming that there can only be one ID in a single stream, maybe a reverse search for EI would work on more PDFs? There's unlikely to be much trailing content.
It would be helpful to have some more examples.
Note: the failing PDF has the following entries /Producer (iLovePDF) /ModDate (D:20211005201753Z)
It consists of two pages, the first is text-based, and the second is a large scanned image.
Interesting.
Do you think you could use a prawn fragment like mine to construct a public-safe sample file that triggers the bug? The actual bytes in the ID
stream can be garbage, we don't need the image to be something sensible for a useful parsing test case.
This causes a crash for me:
Prawn::Document.generate("invalidEI.pdf") do
text "This page has an invalid inline image"
add_content("\n")
add_content("BI")
add_content("/W 16")
add_content("/H 16")
add_content("/BPC 1")
add_content("/IM true")
add_content("ID")
add_content("\xFEabcdefgIEI") # EI can occur anywhere in data
add_content("\xFEabcdefgIEI")
add_content("\xFEabcdefgIEI")
end
It definitely doesn't seem ideal that the logic here can so easily loop endlessly to the end of the file: https://github.com/yob/pdf-reader/blob/ecc6ffe22bef6f155526ca57de61d33ce9c2c0a0/lib/pdf/reader/buffer.rb#L233
Do you have any suggestions for making that logic safer? At the absolute lease it should raise a MalformedPDFError
if it gets to the end of the stream and hasn't found an EI
. Even earlier than the end of the stream if possible.
However, that aside it seems like there's also cases where the content stream is valid but we're missing the EI
.
It would be helpful to have some more examples
It would. However, I'd happily accept a PR that adds a text extraction integration test for the PDF generated by your above comment and gets it green. I'm always happy to expand coverage and spec compliance incrementally as more sample files and bugs are flagged, and confirming the PDF doesn't trigger an exception is a good improvement, even if there's some uncertainty over the precise parsing of the ID bytes.
Given that one can go back and forth at will in the source data, there is no need to store the bytes whilst looking for the EI. I see 2 options:
Once the location has been found, the data can be extracted.
Rather than cache the bytes in a temporary buffer, I think I would use a simple state machine to find EI
Given that EI
===
Also, maybe update buffer.rb so that it throws an exception when trying to read past EOF? That would stop any endless loop due to failure to check for EOF. (In this case, the nil returned at EOF causes a crash when it is read from the local buffer, but that might not always be the case)
I've made some progress on this, however it may be a bit more involved than I first thought. The stream may contain a lot of other commands after the ID/EI, so one cannot skip to the end as I had hoped. However that is just an optimisation.
The current code looks for [\s\0]EI. I think it should look for a line-terminator as well. This will reduce the chance of it appearing in the binary data.
However it won't solve this issue, as the EOL is missing before the EI.
So I propose to look for EI[\r\n]. If this is preceeded by [\s\0], then we are done. If not, keep looking for a 'better' EI until the end of the stream.
I am hoping that it's not possible to have more than 1 ID in a stream. If it is, then additional checks will need to be made in the (hopefully rare) case where we continue to look for a 'better' EI.
I should have a PR ready in a day or so.
From my copy of the spec, there's no mention of a per-content stream limit. There is some interesting talk about whitespace though:
Inline image objects shall not be nested; that is, two BI operators shall not appear without an intervening EI to close the first object. Similarly, an ID operator shall only appear between a BI and its balancing EI. Unless the image uses ASCIIHexDecode or ASCII85Decode as one of its filters, the ID operator shall be followed by a single white-space character, and the next character shall be interpreted as the first byte of image data.
It sounds like maybe we should be using standard tokenising logic to look for the end of the data?
Maybe the code should look for BI ... EI and handle that as a separate object then?
However, there is still the issue of what constitutes a valid EI token. Currently it only looks for whitespace/NUL before the EI I suspect it should also check for EOL (perhaps EOF) afterwards, to reduce the chance of picking up EI in a data stream.
Can a PDF use any whitespace after EI, or must it be EOL? (Later: there are plenty of examples with xxxx\nEI Q, so in that case only a space is needed. But is a space sufficent after \<sp>EI or \<nul>EI?)
To support the iLovePDF case, the code should definitely look for EOL after EI.
There's another potential issue here: in other parts of the spec, CRLF is treated as a single EOL character. Elsewhere this means that both characters are dropped from the preceding token. That does not happen here currently.
With the latest change (7def970) to PR#379, the code:
elsif prevchr == NULL_BYTE
can be replaced by else
to allow any char before EI.
This fixes the crash that instigated this issue, and does not break any tests.
I'll try some more tests with our collection of PDFs over the next few days, but it is looking promising.
My tests did not find any issues, so I have proposed the PR as above
Now fixed
I have seen a crash at: https://github.com/yob/pdf-reader/blob/6cc5c6663eb83c20ce5c43c06b3db716829ba969/lib/pdf/reader/buffer.rb#L238 no implicit conversion of nil into String (TypeError)
I think this is caused by 'chr = @io.read(1)' at line 235 which can return null at EOF. Sorry, cannot share the PDF which caused this as it contains private info.