yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Don't apply character/word spacing to individual glyph displacement by the TJ operator #343

Closed sbilharz closed 3 years ago

sbilharz commented 3 years ago

We extract a lot of text from PDF files and I am currently trying to switch from pdftotext to this gem. There are cases in the wild where creators use weird (?) combinations of character/word spacing and individual glyph displacement via the TJ operator which visually cancel each other out. At least they do so in all the viewers I have tried. They do not so with the current implementation of pdf-reader. That's because here, character spacing is additionally applied to every TJ displacement, which seems to be wrong. The wording in the standard isn't exactly explicit about that but I find that my handcrafted test case proves me right.

TJ_and_char_spacing The first line is compressed by a negative character/word spacing. The second line has this effect reversed by individual glyph displacement via TJ. The third line is normal text for reference.

The text output with current master is the following:

> puts reader.pages.first.text
Thebi brownfox
Thebi brownfox
The big brown fox

The text output with this PR applied is:

> puts reader.pages.first.text
Thebi brownfox
The big brown fox
The big brown fox

No other cases in the test suite are affected by this change except for the ones I corrected since they obviously expected the wrong behavior.

Please let me know if there is something wrong or missing in this pull request!

yob commented 3 years ago

Thanks for the clear PR description and well constructed test case :+1:

sbilharz commented 3 years ago

Thanks for merging! I am currently preparing to make a few more fixes. It's really fun to dig into the PDF internals and your code! :-)

yob commented 3 years ago

Perfect, if they're all bugfixes maybe we can batch them up into a 2.4.3 release.

On Fri, 29 Jan 2021, 19:08 sbilharz, notifications@github.com wrote:

Thanks for merging! I am currently preparing to make a few more fixes. It's really fun to dig into the PDF internals and your code! :-)

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/yob/pdf-reader/pull/343#issuecomment-769648498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB7RDWNZFDGFRADWW55WLS4JUHJANCNFSM4WXGCLFA .

sbilharz commented 3 years ago

Sounds great! Some of them might get a little bigger, but I have at least one other bugfix to make. I did a lot of monkeypatching to pdf-reader classes in my project to fix my specs, and now I have to see which changes make sense in general and which better stay in my own PageTextReceiver. I'll try to pick my stuff to pieces and then we can discuss each of them individually.