wkhtmltopdf / wkhtmltopdf

Convert HTML to PDF using Webkit (QtWebKit)
https://wkhtmltopdf.org
GNU Lesser General Public License v3.0
13.97k stars 1.82k forks source link

Overlapping text when using text-align:justify #4532

Open Heloukli opened 4 years ago

Heloukli commented 4 years ago

Hello, I'm Trying to convert this HTML to pdf using wkhtmltopdf 0.12.5 using options : --stop-slow-scripts --disable-javascript --zoom 1.25

I'm getting this as a result

image

I know that there is a CRLF between the two overlapping words, removing solves the issue, so does replacing text-align:justify with text-align:left.

Is there a way correctly convert this html, without having to edit the HTML content ?

I thank you in advance for your help.

PhilterPaper commented 4 years ago

I do not get the overlap you're reporting, with or without the command-line flags you used. However, I do see something odd when using your flags: the line is split (same place as yours) between "judiciaire de la" and "procédure collective" (on the next line), but the "a" in "de la" vanishes ("de l", next line "procédure").

I'm using wkHTMLtoPDF 0.12.5 with patched qt, on Windows 10.

Just for grins and giggles, what happens if you replace the HTML entities (&# 233; etc.) with actual UTF-8 encoded characters (and add that encoding to the HTML <head>)? If the original HTML uses the entities, that might not help you much, but it might narrow down the problem.

Heloukli commented 4 years ago

Hello Phil,

Thank you for your answer,

So I tried replacing the HTML entites with original characters ( à, é , etc) after adding UTF-8 to the HTML head, I also tried replacing the entities with ( a, e etc ...) then with ( é , à and ')

the wk converter behavior doesn't seem to change, I still have the same two words overlapping.

I'm uploading my html file as .txt, since the online editor seems to change the expected output.

cor1.txt

PhilterPaper commented 4 years ago

I don't know if it was your intention, but all the accented characters are gone from the HTML file (replaced by unaccented ASCII). I now get the overlapping "la" and "suite" that you originally reported, but only when using the --zoom flag. So, it looks like UTF-8 is innocent, but some combination of --zoom and text-align:justify is causing the problem.

If I change the sentence split point (physical end of line) to another word, the last word of the first line and the first word of the second line still overlap, although at different places (i.e., the second word doesn't start at the same place as the first word).

Heloukli commented 4 years ago

It was my intention to remove all accented characters, as I mentioned in my previous post, I was trying to understand whether or not having accented characters is what causes the words to overlap.

And since the output remained the same, ( overlapping same two words), I'm only left with text-align:justify as a possible cause of the issue.

Even without --zoom it still overlaps, it is just less visible

expected output: image

Without zoom image

With Zoom 1.25

image

ernst77 commented 4 years ago

I am having same issues with greek text... any solutions so far? It renders fine on other browsers event QT browser renders it fine.

Options I use

'margin-top' => 25,
'margin-bottom' => 25,
'margin-right' => 0,
'margin-left' => 0,
'zoom' => 1,
'disable-smart-shrinking' => true,
'enable-javascript' => true,
'no-stop-slow-scripts' => true,
'image-quality' => 20,
'dpi' => 100,
'lowquality' => true,

image

Heloukli commented 4 years ago

@ErnestStaug The workaround I found, was to pre-process my HTML files, replacing all text-align:justify with text-align:center, before converting with webkit, I'm wondering if it fixes the over lapping in your case.

victorlmtavares commented 2 years ago

For anyone that is still stuck on this, for me what did the trick was adding this css:

<style> html, body { text-align: justify; text-justify: inter-word; text-rendering: optimizeLegibility; word-break: break-word; } </style>

Hope this helps!