yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Hello, world! appears as "! lro W , ll eH" #376

Closed sebbASF closed 2 years ago

sebbASF commented 2 years ago

The page text parser shows "! lro W , ll eH" instead of "Hello World!" for the following PDF:

http://pdfsharp.com/PDFsharp/images/stories/samples/PDFs/HelloWorld.pdf

Seems that there is some incorrect text positioning happening.

sebbASF commented 2 years ago

The problem is that the X cursor position is being decremented rather than incremented when the letters are processed

Changing '-100 Tz' to ' 100 Tz' fixes the page text, but Preview reverses the output, so I suspect there is another value somewhere that is not being taken into account.

yob commented 2 years ago

When I've seen output like that in the past it was bad rotation logic and I assumed this would be the same, but the same file very clearly has no rotation.

I just checked and none of the existing files in spec/data/*.pdf have a negative argument to Tz, or a negative font size argument to Tf. This does seem like a legit case that pdf-reader doesn't handle properly yet. Nice find!

sebbASF commented 2 years ago

I tried several readers on macOS (Adobe, Foxit, Skim, Preview) and they all show "!dlroW ,olleH" when parsing '100 Tz'.

As well as the direction of text, it looks like pdf-reader has some spacing issues.

Note: when testing this it is necessary to ensure that the text starts in the middle somewhere. Text that runs right to left starting from bottom left is not shown...

yob commented 2 years ago

The positioning issues in HelloWorld.pdf are addressed by the fixes I'm working on for #397. Hopefully I'll land them on main in the next few days and we can resolve this issue at the same time.