Closed sebbASF closed 2 years ago
Thanks!
I note that the spec suite (including all the same files) is still green, so this seems overall safe. I'm interested in the expected impact of these changes - are there PDFs in the wild the currently would raise a parsing exception without this PR? Or maybe they'll parse without error, but incorrectly?
The changes fix output issues rather than crashes.
The initial motivation for the change was an error I noticed while parsing a private PDF. Spurious letters were being added to various lines in the output. This turned out to be due to line-wraps in literal strings. Failure to remove the wraps meant that extra bytes were left in the string, hence the spurious characters.
As part of researching the rules for line-wraps, I noticed some other non-compliances.
I'll see what I can do about providing some test files.
Lovely, thanks!
I merged this with #370, confirmed the specs are green (on modern rubies) and that the text extraction is improved.
The tests are failing on ruby 2.0 and 2.1, but for an unrelated issue. I'll merge this and #370, and then open a follow up PR to address the unrelated issue and see what you think.