wooio / htmltopdf-java

An HTML to PDF conversion library written in Java, based on wkhtmltopdf.
MIT License
173 stars 97 forks source link

Encoding of Right Double Quote #32

Closed aelfric closed 2 years ago

aelfric commented 5 years ago

I know there have been several other issues with character encoding that were solved by including a proper meta tag. I've found one that I can't resolve that way. Here is a minimal example to reproduce:

    pdfHtml = "<!doctype html><html><head><meta charset=\"utf-8\"></head><body><del>Statement of Financial Accounting Standards No. 7,</del><ins>FASB ASC Topic 915,</ins><span> </span><del>“Accounting and Reporting by </del><span>Development Stage </span><del>Enterprises” </del><ins>Entities, </ins></li></body></html>";
    String fileName = "temp_pdf_" + System.currentTimeMillis() + ".pdf";
    HtmlToPdf.create()
        .object(HtmlToPdfObject.forHtml(
            new String(pdfHtml.getBytes(StandardCharsets.UTF_8.name())))
            .loadImages(true)
            .pageCount(true))
        .convert(new RegMapperConfig().getTempFolderLocation() + "/" + fileName);

The output this produces is here.

For some reason the opening left double quote is rendered correctly, but the right double quote does not appear.

I've tried running wkhtmltopdf on the command line with the same input and it correctly renders both quotes. Any suggestions?

benbarkay commented 5 years ago

Hi Frank,

Sorry for the late reply. It is the holidays season here. Try replacing instances of right double quote with "\u201d". You can alternatively either change the default character encoding of the machine which compiles your code to UTF-8, or specify -encoding UTF-8 to javac when you compile the classes, but I personally recommend just escaping special unicode characters if you must keep them hardcoded, or even better, use ResourceBundle to load your strings from a localized configuration file.

On Wed, Oct 9, 2019 at 2:02 AM Frank Riccobono notifications@github.com wrote:

I know there have been several other issues with character encoding that were solved by including a proper meta tag. I've found one that I can't resolve that way. Here is a minimal example to reproduce:

pdfHtml = "<!doctype html><html><head><meta charset=\"utf-8\"></head><body><del>Statement of Financial Accounting Standards No. 7,</del><ins>FASB ASC Topic 915,</ins><span> </span><del>“Accounting and Reporting by </del><span>Development Stage </span><del>Enterprises” </del><ins>Entities, </ins></li></body></html>";

String fileName = "temp_pdf_" + System.currentTimeMillis() + ".pdf";

HtmlToPdf.create()

    .object(HtmlToPdfObject.forHtml(

        new String(pdfHtml.getBytes(StandardCharsets.UTF_8.name())))

        .loadImages(true)

        .pageCount(true))

    .convert(new RegMapperConfig().getTempFolderLocation() + "/" + fileName);

The output this produces is here https://github.com/wooio/htmltopdf-java/files/3704851/change_report.60.pdf .

For some reason the opening left double quote is rendered correctly, but the right double quote does not appear.

I've tried running wkhtmltopdf on the command line with the same input and it correctly renders both quotes. Any suggestions?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/wooio/htmltopdf-java/issues/32?email_source=notifications&email_token=AA4DXUWTEVQFEDR6JFV6FFLQNUGQBA5CNFSM4I6YCLZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HQPLB5Q, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4DXUWSMREOXUUIPD7RP63QNUGQBANCNFSM4I6YCLZA .

-- בן ברקאי

benbarkay commented 5 years ago

The above also applies to #31