yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Lookup glyph metrics by name instead of codepoint #344

Closed sbilharz closed 3 years ago

sbilharz commented 3 years ago

When writing PDFs with German text, for example, the built-in font encodings have to be extended to get access to Umlaut characters (äöüÄÖÜ). A sensible strategy, one would think, would be to use previously unused codepoints for that, as every built-in font has some of them. Most PDF creators probably do that, but I have had cases where a codepoint that was already mapped was reused. Because the lookup mechanism only uses the default mapping from the AFM files and has no clue about the differences in the PDF font encoding, the metrics for the default glyph are returned. The width differences are small, of course. I wasn't able to craft a test PDF where the output of page.text would actually be wrong. The returned values of WidthCalculator::BuiltIn#glyph_width are incorrect nonetheless and it did make a difference in my custom PageTextReceiver where I work with the widths of the TextRuns (otherwise I clearly wouldn't have noticed). I am not sure what your reasons were to introduce lookup by codepoint (performance maybe?). I tried to somehow preserve it as the default mechanism and only use the name lookup when the glyph code hasn't been mapped (as it is implemented now) or the encoding differences include it. But as far as I can see, the Encoding#int_to_name method mostly does exacly that, so there is little reason to anticipate it and then do it again.

I also shortened the last part of the #glyph_width method. If you don't agree with that I can undo it.

yob commented 3 years ago

Yup, honouring the differences table seems like the right thing to do here. I have a small bit stylistic feedback, but otherwise happy to merge.

it did make a difference in my custom PageTextReceiver where I work with the widths of the TextRuns (otherwise I clearly wouldn't have noticed).

I must admit to wondering how you noticed. I certainly hadn't!

I am not sure what your reasons were to introduce lookup by codepoint (performance maybe?).

It's been so long since I wrote this code, I'm not sure if performance was a consideration. In any case, I'd rather accept the patch so we're more correct and then optimise later (with the safety of the new spec) if performance becomes a problem.

sbilharz commented 3 years ago

I must admit to wondering how you noticed. I certainly hadn't!

We are extracting text from a lot of PDF files from our customers and are currently using pdftotext as the "backend" for that. Meaning, we get that typical XML word list and have our own logic to assemble these to lines against which we then match some regexps. Now I am trying to get the equivalent to the pdftotext output from pdf-reader, so I wrote a custom PageTextReceiver that doesn't return the whole text but an array of TextRuns which have been merged to represent whole words and still contain exact information about their position on the page. Now we have a file being used in some specs, where a certain row of text is implemented by several TJ operators that are positioned with absolute offets (I don't know why it has been done that way; the splits are even in the middle of words). Due to the incorrect (namely smaller) width of one glyph in that text, our merging logic placed a space character where there shouldn't be one (because the absolute offset for the next TJ "box" was too far away). So the regexp the spec expected to match in fact didn't.

In other words, I got a failing spec and did a lot of debugging ;-)

yob commented 3 years ago

thanks!