Open RichardsonWTR opened 3 years ago
We're not intentionally skipping sueprscript, but depending on how they're encoded there's a few reasons why they might be missing from the output.
The mostly likely is that pdf-reader's naive "render text of different sizes onto a page of fixed width plain text characters" algorithm thinks that the st
needs to be rendered in the same position as the 1
so it skips them.
Long term I'd love to improve that algorithm (it's in PDF::Reader::PageLayout
, but I'm pretty short on time. If you're able to provide a copy of the PDF, I can at least take a look and confirm the root cause for you.
Thanks for your quick feedback! Here it is @yob !
Yup, it's the naive algorithim in PageLayout.
If I extract the text from page 1, and inspect the value of @runs
at this point: https://github.com/yob/pdf-reader/blob/8557768313c71de59298c5da0dac1404cf50afbb/lib/pdf/reader/page_layout.rb#L20
It looks like this:
[
"st" w:4.641 size:7px @62.8,778.6,
"1 page test" w:55.928 size:12px @56.8,773.9
]
It's decided that the st
baseline (y==778.6) is sufficiently different to the baseline of the characters near it (y=773.9
) that it's a separate text run. Once that happens, it won't render the characters over eachother on the final layout.
I'd happily accept a PR that improves the specific case of super text if you're up for it.
The test file you've provided would be perfect for a new spec in spec/integration_spec.rb
. The fix may not be super easy, but you'd have to start by making this grouping by Y smarter: https://github.com/yob/pdf-reader/blob/8557768313c71de59298c5da0dac1404cf50afbb/lib/pdf/reader/page_layout.rb#L100-L104
I've just created a document with LibreOffice, just typed "1st page test" and exported it to a PDF file.
The LibreOffice had automatically superscripted the 'st' letters.
The pdf-reader gem returns "1 page test".