yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

Strange behaviour parsing PDF File #216

Open ondrejbartas opened 7 years ago

ondrejbartas commented 7 years ago

Hi,

I have this wierd error:

screen shot 2017-05-02 at 14 52 07

And I am getting this result by x =File.open('~/billapp.pdf', 'rb')

I am adding that PDF here billapp.pdf

With other PDFs it is working fine but with this one not :(

yob commented 2 years ago

Sorry I didn't get a around to looking into this in 2017 😞

I just had a proper look and confirmed this issue is still happening in v2.8.0, and that evince can extract the text correctly. It's surprising because the file metadata claims it was created by prawn, and usually pdf-reader can handle prawn generated files just fine.

The root issue appears to be this conditional: https://github.com/yob/pdf-reader/blob/951f9c2659ce3b25c7731d79d54a2ce4ae3bc8e4/lib/pdf/reader/font.rb#L54-L60

The fonts in this file have ToUnicode cmaps so we defer all unicode conversion to them. However, the CMaps only have a handful of mappings defined in them. I'm not sure if the CMaps should have some default mappings in them, or maybe we should be falling back to the encoding dict for glyphs not explicitly listed in the CMap 🤔