yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.82k stars 271 forks source link

Fix glyph positioning in some rotation scenarios #403

Closed yob closed 2 years ago

yob commented 2 years ago

Including some rotated pages, and some rotated text on non-rotated pages.

When processing glyph displacement after rendering a glyph, the spec is pretty clear that the calculation should be:

      [ 1  0  0 ]
Tm =  [ 0  1  0 ]  x Tm
      [ tx ty 1 ]

However, for years pdf-reader has had it backwards:

           [ 1  0  0 ]
Tm =  Tm x [ 0  1  0 ]
           [ tx ty 1 ]

We'd built up some compensating bugs to cover that for some PDFs too, like using a calculated font size instead of the raw font size from the page state. Also a divide by ctm.a that made no sense, and there was even a comment saying that.

Fixing the order of the matrix multiplication means those compensating bugs can also go away.

There are some minor changes to the text output of the columns spec, which I'm willing to wear. Mostly whitespace changes - nothing significant - so I've updated the spec to match. I suspect these actually indicate some additional bugs in glyph displacement - particularly the way we process numeric arguments to the TJ (show_text_with_positioning) operator. I think this commit is an overall nett positive as it fixes some significant glyph positioning issues. We can iterate on the TJ operator handling separately.

Finally, there's a couple of tweaks to the apply_rotation in PageTextReceiver. This method is still buggy, and I've left a comment with some details. The current version will shift the characters around so they're positioned correctly relative to eachother, but the final x and y values are incorrect relative to the overall page boxes. I'll fix that up separately.

These changes were driven by a failing spec with a PDF based on the failure reported at #397. It's a page that's rotated by 270 degrees, and the rotation is undone in the BT block rather than via the CTM.

Fixes #376 Fixes #316 Fixes #271 Fixes #110