yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Superscript words not being returned. #362

Open RichardsonWTR opened 3 years ago

RichardsonWTR commented 3 years ago

I've just created a document with LibreOffice, just typed "1st page test" and exported it to a PDF file.
The LibreOffice had automatically superscripted the 'st' letters. Screenshot from 2021-09-08 10-58-50

The pdf-reader gem returns "1 page test".

yob commented 3 years ago

We're not intentionally skipping sueprscript, but depending on how they're encoded there's a few reasons why they might be missing from the output.

The mostly likely is that pdf-reader's naive "render text of different sizes onto a page of fixed width plain text characters" algorithm thinks that the st needs to be rendered in the same position as the 1 so it skips them.

Long term I'd love to improve that algorithm (it's in PDF::Reader::PageLayout, but I'm pretty short on time. If you're able to provide a copy of the PDF, I can at least take a look and confirm the root cause for you.

RichardsonWTR commented 3 years ago

Thanks for your quick feedback! Here it is @yob !

PDF test.pdf

yob commented 3 years ago

Yup, it's the naive algorithim in PageLayout.

If I extract the text from page 1, and inspect the value of @runs at this point: https://github.com/yob/pdf-reader/blob/8557768313c71de59298c5da0dac1404cf50afbb/lib/pdf/reader/page_layout.rb#L20

It looks like this:

[
  "st" w:4.641 size:7px @62.8,778.6,
  "1 page test" w:55.928 size:12px @56.8,773.9
]

It's decided that the st baseline (y==778.6) is sufficiently different to the baseline of the characters near it (y=773.9) that it's a separate text run. Once that happens, it won't render the characters over eachother on the final layout.

I'd happily accept a PR that improves the specific case of super text if you're up for it.

The test file you've provided would be perfect for a new spec in spec/integration_spec.rb. The fix may not be super easy, but you'd have to start by making this grouping by Y smarter: https://github.com/yob/pdf-reader/blob/8557768313c71de59298c5da0dac1404cf50afbb/lib/pdf/reader/page_layout.rb#L100-L104