Hello,
I have a peculiar issue with PDFs generated using this program. I'm running Arch Linux with the latest updates installed as well as the ttf-ms-win10 fonts package (more info) which seems to work great for allowing use/display of Microsoft Fonts for most things. The wkhtmltopdf version I'm using is:
community/wkhtmltopdf 0.12.4-1 [installed]
Command line tools to render HTML into PDF and various image formats
And the version of Adobe Acrobat I'm using is Adobe Acrobat XI Pro 11.0.14.16
I have a workflow which involves converting HTML to PDF, and then using Adobe Acrobat (on windows) to work on the PDFs. A new step which I've been trying to integrate is using the "Remove Hidden Information" tool (Tools > Protection > Remove Hidden Information) to remove metadata and hidden text within the PDF file. This is where my issue lies.
When using this tool on a PDF which contains certain Microsoft Fonts (as produced with wkhtmltopdf on my Arch System as detailed above), Acrobat for whatever reason identifies certain characters (Predominantly vertical characters like i, I, l, 1) as "hidden text" and will subsequently remove them.
This is what the PDF looks like from wkhtmltopdf:
And after "Remove Hidden Information" from Acrobat is run:
As you can see, it becomes unusable. (In this particular example, it seems to affect only the lowercase letter l (elle), however in other examples it will affect the other letters listed above)
I've tried tracking down this issue but have gone in loops for a large amount of time across various adobe forums, as well as issues for this project, to no avail.
I've attached a sample HTML file (test.html) and the PDF resulting from using wkhtmltopdf (test.pdf) which, when opened in Acrobat and having the Remove Hidden Information tool run on it, produces the last PDF file (test_acro.pdf), which is illegible and unusable.
I've uninstalled the MS fonts, letting wkthmltopdf fall back to my regular (non-Microsoft) DejaVu fonts, and the issue disappears. I'm unsure of how to proceed since everything except this particular scenario has worked well in the past, but going forward I'd like to be able to figure out why these particular characters are being treated as Hidden Text by adobe acrobat.
Running pdffonts on the first pdf (output by wkhtmltopdf) reveals:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Tahoma CID TrueType Identity-H yes no yes 7 0
Calibri CID TrueType Identity-H yes no yes 8 0
And after running the Remove Hidden Information tool in Acrobat:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Tahoma-Bold CID TrueType Identity-H yes no yes 20 0
Tahoma CID TrueType Identity-H yes no yes 23 0
Calibri CID TrueType Identity-H yes no yes 26 0
Seeing that the fonts are listed as "embedded" via pdffonts, I've tried to "unembed" them (or "fixing" the pdf) in various ways (using gs, qpdf, mutool, pdftocairo, pdftk) but to no avail. Looking at the "PDF Optimizer" dialog for the PDF output by wkhtmltopdf shows no embedded fonts:
And acrobat show the following for both the test.pdf and test_acro.pdf:
Which is strange since it seems to consider the fonts subset while pdffonts doesn't.
I'm pretty stumped at this point. If anyone would like any additional information in relation to this issue, please let me know.
Thank you,
Chris
Hello, I have a peculiar issue with PDFs generated using this program. I'm running Arch Linux with the latest updates installed as well as the
ttf-ms-win10
fonts package (more info) which seems to work great for allowing use/display of Microsoft Fonts for most things. The wkhtmltopdf version I'm using is:And the version of Adobe Acrobat I'm using is
Adobe Acrobat XI Pro 11.0.14.16
I have a workflow which involves converting HTML to PDF, and then using Adobe Acrobat (on windows) to work on the PDFs. A new step which I've been trying to integrate is using the "Remove Hidden Information" tool (
Tools > Protection > Remove Hidden Information
) to remove metadata and hidden text within the PDF file. This is where my issue lies.When using this tool on a PDF which contains certain Microsoft Fonts (as produced with wkhtmltopdf on my Arch System as detailed above), Acrobat for whatever reason identifies certain characters (Predominantly vertical characters like
i, I, l, 1
) as "hidden text" and will subsequently remove them.This is what the PDF looks like from
wkhtmltopdf
:And after "Remove Hidden Information" from
Acrobat
is run: As you can see, it becomes unusable. (In this particular example, it seems to affect only the lowercase letterl
(elle), however in other examples it will affect the other letters listed above)I've tried tracking down this issue but have gone in loops for a large amount of time across various adobe forums, as well as issues for this project, to no avail.
I've attached a sample HTML file (
test.html
) and the PDF resulting from usingwkhtmltopdf
(test.pdf
) which, when opened in Acrobat and having theRemove Hidden Information
tool run on it, produces the last PDF file (test_acro.pdf
), which is illegible and unusable.I've uninstalled the MS fonts, letting
wkthmltopdf
fall back to my regular (non-Microsoft)DejaVu
fonts, and the issue disappears. I'm unsure of how to proceed since everything except this particular scenario has worked well in the past, but going forward I'd like to be able to figure out why these particular characters are being treated asHidden Text
by adobe acrobat.Running
pdffonts
on the first pdf (output by wkhtmltopdf) reveals:And after running the
Remove Hidden Information
tool in Acrobat:Seeing that the fonts are listed as "embedded" via
pdffonts
, I've tried to "unembed" them (or "fixing" the pdf) in various ways (usinggs
,qpdf
,mutool
,pdftocairo
,pdftk
) but to no avail. Looking at the "PDF Optimizer" dialog for the PDF output by wkhtmltopdf shows no embedded fonts:And acrobat show the following for both the
test.pdf
andtest_acro.pdf
:Which is strange since it seems to consider the fonts subset while
pdffonts
doesn't.I'm pretty stumped at this point. If anyone would like any additional information in relation to this issue, please let me know.
Thank you, Chris
test.html.txt test_acro.pdf test.pdf