yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.82k stars 271 forks source link

Improve text extraction from rotated pages #317

Closed yob closed 4 years ago

yob commented 4 years ago

When a page has the Rotate key, the transformation matrix used to convert co-ordinates into the device space should change to apply the rotation.

However, that also highlighted that PageLayout wasn't correctly handling pages where the MediaBox included negative Y values. That doesn't happen very often, but it's technically valid.

I'm not super happy with the changes in PageLayout. They feel hacky. Still, they get the spec green so I think it's worth merging and I can refine the code later if I find the time.

This sample PDF used in the new integration spec is a simple reproduction of the issue reported in #316.

Closes #316