wkhtmltopdf / wkhtmltopdf

Convert HTML to PDF using Webkit (QtWebKit)
https://wkhtmltopdf.org
GNU Lesser General Public License v3.0
14k stars 1.83k forks source link

[Feature Request] Allow for direct footer and header html #5015

Open Chaostheorie opened 3 years ago

Chaostheorie commented 3 years ago

Description To supply a header and/ or footer, a user or developer needs to specify a URL to a file or webpage. This is certainly understandable for the CLI but is very restrictive for the usage with wkhtmltox (the direct C library not a wrapper). It's also directly blocked to simply specify an e.g. base64 encoded string as footer HTML string. So the requested feature is to allow the direct usage of HTML code as value for the footer.htmlUrl / header.htmlUrl object setting.

Related code: https://github.com/wkhtmltopdf/wkhtmltopdf/blob/5b26acc41d8209d3b4c4018ac7350b8cea64fa73/src/lib/pdfconverter.cc#L171-L209

Example

Below is a modified version of the current example for PDF conversion in c++ with the wkhtmltox library.

/* -*- mode: c++; tab-width: 4; indent-tabs-mode: t; eval: (progn (c-set-style "stroustrup") (c-set-offset 'innamespace 0)); -*-
 * vi:set ts=4 sts=4 sw=4 noet :
 *
 * Copyright 2010, 2011 wkhtmltopdf authors
 *
 * This file is part of wkhtmltopdf.
 *
 * wkhtmltopdf is free software: you can redistribute it and/or modify
 * it under the terms of the GNU Lesser General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * wkhtmltopdf is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public License
 * along with wkhtmltopdf.  If not, see <http: *www.gnu.org/licenses/>.
 */

/* This is a simple example program showing how to use the wkhtmltopdf c bindings */
#include <stdbool.h>
#include <stdio.h>
#include <wkhtmltox/pdf.h>

/* Print out loading progress information */
void progress_changed(wkhtmltopdf_converter * c, int p) {
    printf("%3d%%\r",p);
    fflush(stdout);
}

/* Print loading phase information */
void phase_changed(wkhtmltopdf_converter * c) {
    int phase = wkhtmltopdf_current_phase(c);
    printf("%s\n", wkhtmltopdf_phase_description(c, phase));
}

/* Print a message to stderr when an error occurs */
void error(wkhtmltopdf_converter * c, const char * msg) {
    fprintf(stderr, "Error: %s\n", msg);
}

/* Print a message to stderr when a warning is issued */
void warning(wkhtmltopdf_converter * c, const char * msg) {
    fprintf(stderr, "Warning: %s\n", msg);
}

/* Main method convert pdf */
int main() {
    wkhtmltopdf_global_settings * gs;
    wkhtmltopdf_object_settings * os;
    wkhtmltopdf_converter * c;

    /* Init wkhtmltopdf in graphics less mode */
    wkhtmltopdf_init(false);

    /*
     * Create a global settings object used to store options that are not
     * related to input objects, note that control of this object is parsed to
     * the converter later, which is then responsible for freeing it
     */
    gs = wkhtmltopdf_create_global_settings();
    /* We want the result to be storred in the file called test.pdf */
    wkhtmltopdf_set_global_setting(gs, "out", "test.pdf");

    wkhtmltopdf_set_global_setting(gs, "load.cookieJar", "myjar.jar");
    /*
     * Create a input object settings object that is used to store settings
     * related to a input object, note again that control of this object is parsed to
     * the converter later, which is then responsible for freeing it
     */
    os = wkhtmltopdf_create_object_settings();
    /* We want to convert to convert the qstring documentation page */
    wkhtmltopdf_set_object_setting(os, "page", "http://doc.qt.io/qt-5/qstring.html");

    /* We want to set a custom footer from a string. The used sting isbuilt similiar to the strings required for image srcs */
    wkhtmltopdf_set_object_setting(os, "footer.htmlUrl", "data:text/html;base64,PGgxPlA8L2gxPg==");

    /* Create the actual converter object used to convert the pages */
    c = wkhtmltopdf_create_converter(gs);

    /* Call the progress_changed function when progress changes */
    wkhtmltopdf_set_progress_changed_callback(c, progress_changed);

    /* Call the phase _changed function when the phase changes */
    wkhtmltopdf_set_phase_changed_callback(c, phase_changed);

    /* Call the error function when an error occurs */
    wkhtmltopdf_set_error_callback(c, error);

    /* Call the warning function when a warning is issued */
    wkhtmltopdf_set_warning_callback(c, warning);

    /*
     * Add the the settings object describing the qstring documentation page
     * to the list of pages to convert. Objects are converted in the order in which
     * they are added
     */
    wkhtmltopdf_add_object(c, os, NULL);

    /* Perform the actual conversion */
    if (!wkhtmltopdf_convert(c))
        fprintf(stderr, "Conversion failed!");

    /* Output possible http error code encountered */
    printf("httpErrorCode: %d\n", wkhtmltopdf_http_error_code(c));

    /* Destroy the converter object since we are done with it */
    wkhtmltopdf_destroy_converter(c);

    /* We will no longer be needing wkhtmltopdf funcionality */
    wkhtmltopdf_deinit();

    return 0;
}
PhilterPaper commented 3 years ago

According to the "help":

      --default-header                Add a default header, with the name of the
                                      page to the left, and the page number to
                                      the right, this is short for:
                                      --header-left='[webpage]'
                                      --header-right='[page]/[toPage]' --top 2cm
                                      --header-line

Headers And Footer Options:
      --footer-center <text>          Centered footer text
      --footer-font-name <name>       Set footer font name (default Arial)
      --footer-font-size <size>       Set footer font size (default 12)
      --footer-html <url>             Adds a html footer
      --footer-left <text>            Left aligned footer text
      --footer-line                   Display line above the footer
      --no-footer-line                Do not display line above the footer
                                      (default)
      --footer-right <text>           Right aligned footer text
      --footer-spacing <real>         Spacing between footer and content in mm
                                      (default 0)
      --header-center <text>          Centered header text
      --header-font-name <name>       Set header font name (default Arial)
      --header-font-size <size>       Set header font size (default 12)
      --header-html <url>             Adds a html header
      --header-left <text>            Left aligned header text
      --header-line                   Display line below the header
      --no-header-line                Do not display line below the header
                                      (default)
      --header-right <text>           Right aligned header text
      --header-spacing <real>         Spacing between header and content in mm
                                      (default 0)
      --replace <name> <value>        Replace [name] with value in header and
                                      footer (repeatable)
Footers And Headers:
  Headers and footers can be added to the document by the --header-* and
  --footer* arguments respectfully.  In header and footer text string supplied
  to e.g. --header-left, the following variables will be substituted.

   * [page]       Replaced by the number of the pages currently being printed
   * [frompage]   Replaced by the number of the first page to be printed
   * [topage]     Replaced by the number of the last page to be printed
   * [webpage]    Replaced by the URL of the page being printed
   * [section]    Replaced by the name of the current section
   * [subsection] Replaced by the name of the current subsection
   * [date]       Replaced by the current date in system local format
   * [isodate]    Replaced by the current date in ISO 8601 extended format
   * [time]       Replaced by the current time in system local format
   * [title]      Replaced by the title of the of the current page object
   * [doctitle]   Replaced by the title of the output document
   * [sitepage]   Replaced by the number of the page in the current site being converted
   * [sitepages]  Replaced by the number of pages in the current site being converted

  As an example specifying --header-right "Page [page] of [toPage]", will result
  in the text "Page x of y" where x is the number of the current page and y is
  the number of the last page, to appear in the upper left corner in the
  document.

  Headers and footers can also be supplied with HTML documents. As an example
  one could specify --header-html header.html, and use the following content in
  header.html:

  <!DOCTYPE html>
  <html><head><script>
  function subst() {
      var vars = {};
      var query_strings_from_url = document.location.search.substring(1).split('&');
      for (var query_string in query_strings_from_url) {
          if (query_strings_from_url.hasOwnProperty(query_string)) {
              var temp_var = query_strings_from_url[query_string].split('=', 2);
              vars[temp_var[0]] = decodeURI(temp_var[1]);
          }
      }
      var css_selector_classes = ['page', 'frompage', 'topage', 'webpage', 'section', 'subsection', 'date', 'isodate', 'time', 'title', 'doctitle', 'sitepage', 'sitepages'];
      for (var css_class in css_selector_classes) {
          if (css_selector_classes.hasOwnProperty(css_class)) {
              var element = document.getElementsByClassName(css_selector_classes[css_class]);
              for (var j = 0; j < element.length; ++j) {
                  element[j].textContent = vars[css_selector_classes[css_class]];
              }
          }
      }
  }
  </script></head><body style="border:0; margin: 0;" onload="subst()">
  <table style="border-bottom: 1px solid black; width: 100%">
    <tr>
      <td class="section"></td>
      <td style="text-align:right">
        Page <span class="page"></span> of <span class="topage"></span>
      </td>
    </tr>
  </table>
  </body></html>

  As can be seen from the example, the arguments are sent to the header/footer
  html documents in get fashion.

Does this do what you want?

Chaostheorie commented 3 years ago

@PhilterPaper I must've missed those options in the doc. Thank you for your quick and helpful response. Will try it out later.

Chaostheorie commented 3 years ago

Well, just tried it out and the option doesn't support HTML or any other Markup. This is in my case required since the footer contains a link to the license of the converted document.

Example content:

Created by Cobalt - Jun  3, 2021 - <a href="https://creativecommons.org/licenses/by/4.0/">Licensed under CC-BY 4.0</a>
PhilterPaper commented 3 years ago

It appears to be disabling HTML tags in header and footer entries (see the tags as ordinary text). I'm not sure why they would have done that, but apparently they did. I don't see a flag to disable that behavior, so it would probably require a code change.

Chaostheorie commented 3 years ago

Would you mind pointing me to the relevant line? I would love to work on this issue, but I'm unable to find the related piece of code that parses the header/ footer values and puts them into objects. I have only been able to pinpoint the lines for drawing them, but I haven't been able to find the logic for parsing them. Any help would be appreciated

https://github.com/wkhtmltopdf/wkhtmltopdf/blob/5b26acc41d8209d3b4c4018ac7350b8cea64fa73/src/lib/pdfconverter.cc#L649-L651

EDIT: After some more greps, it seems like the painter->drawText part probably should be changed with something that renders HTML instead.

PhilterPaper commented 3 years ago

I don't know where in the code it's doing this. I'm just describing from the evidence that HTML tags come out as ordinary text, that someone must have decided to disable HTML in the header/footer code. I can't see it as being a security gain, so it was probably done without thinking it through. Or, someone didn't consider that HTML might be in a header or footer field, and wrote the code to handle only plain text. In the former case, you can probably look for something that grabs < and turns it into a &lt; or similar, before sending the text on to "normal" processing; in the latter, you would have to send text to the normal HTML processing (inline tags only, you don't want to put a list or paragraph or other block items in a header or footer).

lstarky commented 1 year ago

I agree with this; using a static HTML file for header-html and footer-html is very limiting for a mail-merge type of application. All of the content, including header and footer, should be able to be generated on the fly, and the basic header and footer text options are not sufficient to allow formatting and images.

I especially like the idea of the header and footer content being embedded right in the main HTML and extracted somehow.

3053