wkhtmltopdf / wkhtmltopdf

Convert HTML to PDF using Webkit (QtWebKit)
https://wkhtmltopdf.org
GNU Lesser General Public License v3.0
13.97k stars 1.82k forks source link

Certain Microsoft Font Glyphs treated as "Hidden" by Acrobat #3663

Open TobalJackson opened 7 years ago

TobalJackson commented 7 years ago

Hello, I have a peculiar issue with PDFs generated using this program. I'm running Arch Linux with the latest updates installed as well as the ttf-ms-win10 fonts package (more info) which seems to work great for allowing use/display of Microsoft Fonts for most things. The wkhtmltopdf version I'm using is:

community/wkhtmltopdf 0.12.4-1 [installed]
    Command line tools to render HTML into PDF and various image formats

And the version of Adobe Acrobat I'm using is Adobe Acrobat XI Pro 11.0.14.16

I have a workflow which involves converting HTML to PDF, and then using Adobe Acrobat (on windows) to work on the PDFs. A new step which I've been trying to integrate is using the "Remove Hidden Information" tool (Tools > Protection > Remove Hidden Information) to remove metadata and hidden text within the PDF file. This is where my issue lies.

When using this tool on a PDF which contains certain Microsoft Fonts (as produced with wkhtmltopdf on my Arch System as detailed above), Acrobat for whatever reason identifies certain characters (Predominantly vertical characters like i, I, l, 1) as "hidden text" and will subsequently remove them.

This is what the PDF looks like from wkhtmltopdf: no_missing_letters

And after "Remove Hidden Information" from Acrobat is run: missing_letters As you can see, it becomes unusable. (In this particular example, it seems to affect only the lowercase letter l (elle), however in other examples it will affect the other letters listed above)

I've tried tracking down this issue but have gone in loops for a large amount of time across various adobe forums, as well as issues for this project, to no avail.

I've attached a sample HTML file (test.html) and the PDF resulting from using wkhtmltopdf (test.pdf) which, when opened in Acrobat and having the Remove Hidden Information tool run on it, produces the last PDF file (test_acro.pdf), which is illegible and unusable.

I've uninstalled the MS fonts, letting wkthmltopdf fall back to my regular (non-Microsoft) DejaVu fonts, and the issue disappears. I'm unsure of how to proceed since everything except this particular scenario has worked well in the past, but going forward I'd like to be able to figure out why these particular characters are being treated as Hidden Text by adobe acrobat.

Running pdffonts on the first pdf (output by wkhtmltopdf) reveals:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Tahoma                               CID TrueType      Identity-H       yes no  yes      7  0
Calibri                              CID TrueType      Identity-H       yes no  yes      8  0

And after running the Remove Hidden Information tool in Acrobat:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Tahoma-Bold                          CID TrueType      Identity-H       yes no  yes     20  0
Tahoma                               CID TrueType      Identity-H       yes no  yes     23  0
Calibri                              CID TrueType      Identity-H       yes no  yes     26  0

Seeing that the fonts are listed as "embedded" via pdffonts, I've tried to "unembed" them (or "fixing" the pdf) in various ways (using gs, qpdf, mutool, pdftocairo, pdftk) but to no avail. Looking at the "PDF Optimizer" dialog for the PDF output by wkhtmltopdf shows no embedded fonts: unembed_fonts_missing

And acrobat show the following for both the test.pdf and test_acro.pdf: adobe_fonts_wkhtmltopdf

Which is strange since it seems to consider the fonts subset while pdffonts doesn't.

I'm pretty stumped at this point. If anyone would like any additional information in relation to this issue, please let me know.
Thank you, Chris

test.html.txt test_acro.pdf test.pdf

TobalJackson commented 6 years ago

I solved this issue by switching to puppeteer.