Various fixes to string backslash parsing

yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.

MIT License

1.81k stars 271 forks source link

Various fixes to string backslash parsing #368

Closed sebbASF closed 2 years ago

sebbASF commented 2 years ago

\n\r is not a valid EOL, so produce separate output
\ followed by LF, CR and CRLF are removable line wraps
octal escape validation (and simplification)

yob commented 2 years ago

Thanks!

I note that the spec suite (including all the same files) is still green, so this seems overall safe. I'm interested in the expected impact of these changes - are there PDFs in the wild the currently would raise a parsing exception without this PR? Or maybe they'll parse without error, but incorrectly?

sebbASF commented 2 years ago

The changes fix output issues rather than crashes.

The initial motivation for the change was an error I noticed while parsing a private PDF. Spurious letters were being added to various lines in the output. This turned out to be due to line-wraps in literal strings. Failure to remove the wraps meant that extra bytes were left in the string, hence the spurious characters.

As part of researching the rules for line-wraps, I noticed some other non-compliances.

I'll see what I can do about providing some test files.

yob commented 2 years ago

Lovely, thanks!

I merged this with #370, confirmed the specs are green (on modern rubies) and that the text extraction is improved.

The tests are failing on ruby 2.0 and 2.1, but for an unrelated issue. I'll merge this and #370, and then open a follow up PR to address the unrelated issue and see what you think.