Closed sbilharz closed 3 years ago
Thanks for the clear PR description and well constructed test case :+1:
Thanks for merging! I am currently preparing to make a few more fixes. It's really fun to dig into the PDF internals and your code! :-)
Perfect, if they're all bugfixes maybe we can batch them up into a 2.4.3 release.
On Fri, 29 Jan 2021, 19:08 sbilharz, notifications@github.com wrote:
Thanks for merging! I am currently preparing to make a few more fixes. It's really fun to dig into the PDF internals and your code! :-)
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/yob/pdf-reader/pull/343#issuecomment-769648498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB7RDWNZFDGFRADWW55WLS4JUHJANCNFSM4WXGCLFA .
Sounds great! Some of them might get a little bigger, but I have at least one other bugfix to make. I did a lot of monkeypatching to pdf-reader
classes in my project to fix my specs, and now I have to see which changes make sense in general and which better stay in my own PageTextReceiver
. I'll try to pick my stuff to pieces and then we can discuss each of them individually.
We extract a lot of text from PDF files and I am currently trying to switch from
pdftotext
to this gem. There are cases in the wild where creators use weird (?) combinations of character/word spacing and individual glyph displacement via theTJ
operator which visually cancel each other out. At least they do so in all the viewers I have tried. They do not so with the current implementation ofpdf-reader
. That's because here, character spacing is additionally applied to everyTJ
displacement, which seems to be wrong. The wording in the standard isn't exactly explicit about that but I find that my handcrafted test case proves me right.The first line is compressed by a negative character/word spacing. The second line has this effect reversed by individual glyph displacement via
TJ
. The third line is normal text for reference.The text output with current master is the following:
The text output with this PR applied is:
No other cases in the test suite are affected by this change except for the ones I corrected since they obviously expected the wrong behavior.
Please let me know if there is something wrong or missing in this pull request!