Also skip single CR chars in Buffer#read

sbilharz commented 3 years ago

It is valid to delimit the stream content from the stream/endstream keywords with only a CR instead of CRLF or LF. pdf-reader would crash in some cases until this change, as shown by the test case.

yob commented 3 years ago

thanks ❤️

sbilharz commented 3 years ago

Wow, that was fast ;-)

Always a pleasure!

yob commented 3 years ago

It was an easy review with such a focused change and great spec.

sebbASF commented 3 years ago

The PDF spec does NOT allow a bare CR between the stream token and its data.

This is because the data may start with LF: how can you tell whether the LF is part of CRLF or the start of the data?

See 7.3.8.1 NOTE 2 in the PDF spec at: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

However, it looks like the order of checks in the code will ensure that a bare CR is only accepted if the next character is not LF. This is not allowed by the spec, but at least the patch should not result in losing the first byte of the data.

It might be better to make this permissive behaviour optional, so invalid PDFs can be detected.

sbilharz commented 3 years ago

Damn, you are right. I didn't check the spec, I just had a PDF from the wild that was perfectly openable by all viewers I had in my reach but would crash pdf-reader. Not sure what to do now yet.

sebbASF commented 3 years ago

Note that the test file (content_stream_cr_only.pdf) uses LF for EOL apart from after the stream and endstream tokens. This seems very unlikely in the wild - does your failing example have this characteristic? Is it public?

However it does open OK in the viewers that I tried, so it looks like they are quite permissive.

I suggest making this permissive behaviour optional if possible.

sbilharz commented 3 years ago

Note that the test file (content_stream_cr_only.pdf) uses LF for EOL apart from after the stream and endstream tokens. This seems very unlikely in the wild - does your failing example have this characteristic? Is it public?

It does not. It uses bare \rs as EOL everywhere. Unfortunately it isn't public so I can't share it. That's why I built the test file where I just wanted to quickly reproduce the error which occurred exactly when parsing a stream, because the /Length attribute ignored the LF char but pdf-reader didn't.

However it does open OK in the viewers that I tried, so it looks like they are quite permissive.

That's why I didn't even consider it to be invalid.

I suggest making this permissive behaviour optional if possible.

Possible, of course. But at a price, of course. We would need to drag this option value along all from the top to the bottom, wouldn't we? Or introduce some global state.

Do you have a concrete use case to detect invalid PDF files that are perfectly viewable by all readers?

sebbASF commented 3 years ago

Is is possible to publish the name of the software that created the PDF?

Also, if the intention is to provide a test case to mimic the private file, maybe that should use CR everywhere too. [AFAICT a bare CR is valid as an EOL everywhere -- except after stream, where it is ambiguous.]

yob commented 3 years ago

My general approach to this library has been "If acrobat can open it, then we probably should too".

Nearly every bug report is a version of "I can open this PDF in acrobat/evince/preview, but the gem raises an exception", and I can't bring myself to be pure enough to reply with "I implement the spec, so I won't fix your exception".

An opt-in strict parsing mode might be interesting? I wonder at the use case for it, other than validating a PDF against the spec.

Updating the comments in lib/pdf/reader/buffer.rb might be interesting, to clarify which behaviour is off-spec but included for compatibility.

sbilharz commented 3 years ago

Is is possible to publish the name of the software that created the PDF?

The original file seems to be created by "HP Scan". This is the /Info object from the trailer: {:Keywords=>"", :Creator=>"HP Scan", :CreationDate=>"D:20210912101258-08'00'", :ModDate=>"D:20210912101258-08'00'", :Author=>"", :Producer=>"HP Scan Extended Application", :Title=>"", :Subject=>""}

As the EOL character is always only CR I suppose this was done on a Mac (?)

Some more explanations: The file contains several scanned images that couldn't be opened by pdf-reader because of the bare CRs after the stream keywords. They would not be skipped and thus the last byte of the actual stream would not be read, which lead to the following error when the parser tried to look for the endstream keyword next: PDF::Reader::MalformedPDFError (PDF malformed, expected 'endstream' but found 0 instead) I am not sure why it says found 0 as the last byte of the stream is actually 0xD9 but I suppose that is a different issue. The /Length provided in the dictionary for the stream is set correctly (assuming the CR is skipped).

I think updating the comments is a good idea. The strict-parsing option still lacks a proper use case from my perspective. At least for the systems where I use pdf-reader, the input PDF files always come from customers who don't have a clue what the creating software does, let alone have the ability to alter it. So my goal is to just make things work for them whenever I can.

yob / pdf-reader

Also skip single CR chars in Buffer#read #364