yob / pdf-reader

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
MIT License
1.81k stars 271 forks source link

skip zero-width characters when rendering text to a page #372

Closed yob closed 2 years ago

yob commented 2 years ago

A spec added in #370 included a PDF that visually included the following text:

aaaabbbb

... but the content stream included a zero-width LF character between aaaa and bbbb and pdf-reader text extraction looked like this:

aaaa bbbb

This filters the zero-width LF out, so text output matches the visual appearance of the PDF:

aaaabbbb

yob commented 2 years ago

/cc @sebbASF

sebbASF commented 2 years ago

The test file in question has \LFCR in the middle of a string. The \LF is a permitted line-wrap, so removing that leaves a bare CR. This in turn must be converted to LF according to the spec. That's what the test was trying to check, but it should perhaps have not relied on how the page renderer would treat it. The removal of \LF and replacement of CR by LF is tested in parser_spec.

I'm not sure how an embedded LF is supposed to be displayed, but macOS Preview shows a space. So the test is currently wrong, but not in the way the PR currently suggests

yob commented 2 years ago

yer, interesting. This is how evince renders the page (using libpoppler):

Screenshot from 2021-10-23 11-32-48

pdftotext (also using libpoppler) extracts it as a single line as well.

yob commented 2 years ago

Firefox (using pdf.js) renders like this:

Screenshot from 2021-10-23 11-34-24

Chrome renders it as a space though:

Screenshot from 2021-10-23 11-36-01

sebbASF commented 2 years ago

Screenshot 2021-10-23 at 01 35 45

This is what Preview shows

yob commented 2 years ago

Well, I'm confident that rendering it as two lines isn't the best option!

For now I think I'll merge this to get the suite green. There is a small chance that skipping zero-width characters will skip other characters that should be displayed - particularly if there's bugs in the character width calculation code.

If that happens, the other option we could explore is adding LF (and maybe other whitespace?) to the ignore logic here: https://github.com/yob/pdf-reader/blob/c849c0647ec9d97ab6de504cb2e2849eee614594/lib/pdf/reader/page_text_receiver.rb#L117-L119

sebbASF commented 2 years ago

However dropping spaces causes issues with some PDFs which can be rendered without the necessary gaps between words. Unfortunately the only example I have cannot be made public.

sebbASF commented 2 years ago

I've just tried opening textwraplfcr.pdf with the macOS Skim app.

This also shows a space between the words.

sebbASF commented 2 years ago

Adobe Acrobat Reader and Foxit also show a space (on macOS)