yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Text collision/overlapping issues #406

Open JayNewstrom opened 2 years ago

JayNewstrom commented 2 years ago

Some PDFs cause text collisions and overlapping.

This is a follow up to my bug report in #397 (And the same sample PDF is the one causing these issues)

Screen Shot 2021-12-13 at 4 30 41 PM Screen Shot 2021-12-13 at 4 31 30 PM

As with the previous issue, I'm happy to help where I can.

yob commented 2 years ago

Now that the glyph positioning bugs in your sample PDF were fixed (in #403), I took another look at the page you have a screenshot of here.

I looked at the glyph positions that pdf-reader calculated from a few words that are clearly wrong in the text extraction: "Usage and Purchase Charges".

I'm now very confident that the glyphs positions are being extracted more or less accurately. There may be some very minor issues around kerning and spacing, but that will only through the positions off by a point or two. I'm also confident all the characters are being extracted.

I think the real issue here is the naive algorithm in PageLayout, which is responsible for arranging the extracted text onto a plain text "page". By hand tuning a few lines in PageLayout, I can get your page extracting a bit better:

Screenshot from 2021-12-20 23-53-09

Of course, it throws out the layout of other documents though.

There's a few similar issues - #371 #362 #118 - I'll continue to mull over what a better algorithm might look like. Thanks for a great bug report.

JayNewstrom commented 2 years ago

Thanks for the update! I'm happy to run tests or provide more test files if you'd like!