yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.82k stars 271 forks source link

Skip text drawn outside the MediaBox #413

Closed yob closed 2 years ago

yob commented 2 years ago

When characters are rendered off the page, don't include them in the extracted text.

Ideally this would be the CropBox rather than MediaBox, but I don't have easy access to that in PageLayout and some coming refactors will make that easier to achieve. This is a good start

I don't have a sample PDF to use in an integration test, so I've added a pending spec.