yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Incorrect Layout/Parsing #397

Closed JayNewstrom closed 2 years ago

JayNewstrom commented 2 years ago

Hi, I've got a PDF that isn't being parsed/displayed correctly.

Here's the text I get when calling text on the page.

Screen Shot 2021-11-29 at 10 06 43 AM

And here's what the PDF is supposed to look like.

Screen Shot 2021-11-29 at 10 07 32 AM

I can provide the issued PDF if it would be helpful.

yob commented 2 years ago

Layout bugs aren't unusual, but a letter being replaced is odd (e -> s). I'd need a copy of the PDF to see what's going on - feel free to email it to me if you like (james - at - yob - id - au)

JayNewstrom commented 2 years ago

I followed up via email, let me know if I can help :) Thanks!

yob commented 2 years ago

Thanks for the sample file, and sorry it's taken me a while to dig into it.

One particular page (page 4) in the sample is like it was designed to trigger as many pdf-reader bugs as possible 😂

The root issue is that page has a 270 degree rotation on it, and then the page instructions jump through positioning/rotating hoops to undo the rotation so it looks normal to human eyes. pdf-reader isn't great with rotation, although we're slowly getting better.

I've been noodling on a spike branch (https://github.com/yob/pdf-reader/compare/page-characters?expand=1) that improves various aspects of handling rotated pages. It helps with your sample file, but there's at least one issue left that I haven't traced to a root cause.

The good news is I've managed to create a minimal PDF that exhibits the same issue: rotate-270-then-undo-inside-bt.pdf. Hopefully I can use that to get a solution soon, and polish up the spike PR into something mergable.

JayNewstrom commented 2 years ago

Great news! Let me know if I can do anything to help!

yob commented 2 years ago

I think I've got it sorted on this branch: https://github.com/yob/pdf-reader/compare/page-characters?expand=1.

I'll send you an email with some extracted text from your sample doc to confirm, and if it looks good I'll work on breaking that spike branch into a few logical PRs with descriptions.

yob commented 2 years ago

I'm going to close this issue - the gross errors in positioning of glyphs on rotated pages like the sample document have been fixed.

There's definitely some room for improvement around the fine positioning of glyphs. In the sample doc I can see some of the table header text is bumping into eachother and overlapping. There's some long standing small errors in glyph positioning that I think is happening here. The issues are small, but over a full line they can sometimes compound to be visibly wrong. #261 is a good example.

It could also be an issue with the naive algorithm in PageLayout.

I'd love a separate issue that captures these symptoms if you have the time ❤️