Closed yob closed 2 years ago
/cc @sebbASF
The test file in question has \LFCR in the middle of a string.
The \LF is a permitted line-wrap, so removing that leaves a bare CR.
This in turn must be converted to LF according to the spec.
That's what the test was trying to check, but it should perhaps have not relied on how the page renderer would treat it.
The removal of \LF and replacement of CR by LF is tested in parser_spec.
I'm not sure how an embedded LF is supposed to be displayed, but macOS Preview shows a space. So the test is currently wrong, but not in the way the PR currently suggests
yer, interesting. This is how evince renders the page (using libpoppler):
pdftotext
(also using libpoppler) extracts it as a single line as well.
Firefox (using pdf.js) renders like this:
Chrome renders it as a space though:
This is what Preview shows
Well, I'm confident that rendering it as two lines isn't the best option!
For now I think I'll merge this to get the suite green. There is a small chance that skipping zero-width characters will skip other characters that should be displayed - particularly if there's bugs in the character width calculation code.
If that happens, the other option we could explore is adding LF (and maybe other whitespace?) to the ignore logic here: https://github.com/yob/pdf-reader/blob/c849c0647ec9d97ab6de504cb2e2849eee614594/lib/pdf/reader/page_text_receiver.rb#L117-L119
However dropping spaces causes issues with some PDFs which can be rendered without the necessary gaps between words. Unfortunately the only example I have cannot be made public.
I've just tried opening textwraplfcr.pdf with the macOS Skim app.
This also shows a space between the words.
Adobe Acrobat Reader and Foxit also show a space (on macOS)
A spec added in #370 included a PDF that visually included the following text:
... but the content stream included a zero-width LF character between aaaa and bbbb and pdf-reader text extraction looked like this:
This filters the zero-width LF out, so text output matches the visual appearance of the PDF: