wooio / htmltopdf-java

An HTML to PDF conversion library written in Java, based on wkhtmltopdf.
MIT License
172 stars 97 forks source link

Concurrency support #9

Open benbarkay opened 6 years ago

benbarkay commented 6 years ago

Currently, all requests to wkhtmltopdf are synchronized to a single thread, thus it is not possible to execute conversions concurrently.

lgabeskiria commented 6 years ago

@benbarkay any updates?

benbarkay commented 6 years ago

@lgabeskiria it appears that there aren't many straightforward options. The idea that I'm currently examining is to load multiple separate wkhtmltopdf.so, using RTLD_LOCAL flag. However, I was not able to get JNA to load the library's dependencies with RTLD_LOCAL, so this might take a while or might not be reasonably possible.

Other projects that I've encountered who use wkhtmltpdf as a shared library seemed to have stopped at synchronization (what 1.0.4 currently does) rather than supporting any concurrency. That makes supporting concurrency a very attractive goal, but it also means that perhaps making that happen is impractical.

lgabeskiria commented 6 years ago

You can have look at this project https://github.com/rdvojmoc/DinkToPdf

benbarkay commented 6 years ago

@lgabeskiria they do not support concurrency. They are thread safe, though (in a similar fashion to 1.0.4)

ymohammad commented 5 years ago

Hi @benbarkay ,

I have one doubt, how to replicate the case of concurrency issue. I have tried the following code and it is working fine. It is able to create all the PDFs. Please suggest, how can I replicate the case. I am using Windows 10 Pro operating system and Java 8.

import java.io.BufferedReader;
import java.io.File;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

import io.woo.htmltopdf.HtmlToPdf;
import io.woo.htmltopdf.HtmlToPdfObject;

public class PDFToHTMLUsingLib
{
    public static void main(String[] args) throws MalformedURLException, InterruptedException {
        boolean isProcessBased = false;
        String pdfFilePath = "D:/htmltopdf/multithread/";

        ArrayList<HashMap<String, String>> list = getDataList(pdfFilePath);

        PDFToHTMLUsingLib obj = new PDFToHTMLUsingLib();
        obj.startAllThreads(list, isProcessBased);
        System.out.println("All THreads execution is started..");
    }

    public void startAllThreads(List<HashMap<String, String>> hashList, boolean isProcessBased) throws InterruptedException {
        for ( HashMap<String, String> eachMap : hashList) {
            Thread th = new Thread(new MyThread(eachMap.get("HTML_PATH"), eachMap.get("PDF_PATH"), isProcessBased));
            th.start();
            //th.join();
        }
    }
    class MyThread implements Runnable {

        private String htmlFilePath = "";
        private String outputFilePath = "";
        private boolean isProcessedBased = false;
        public MyThread(String htmlPath, String pdfPath, boolean isProcessBased) {
            this.htmlFilePath = htmlPath;
            this.outputFilePath = pdfPath;
            this.isProcessedBased = isProcessBased;
        }
        @Override
        public void run()
        {
            if (this.isProcessedBased) {
                createPDFUsingProcess();
            } else {
                createPDFUsingLib();
            }

        }
        private void createPDFUsingProcess()
        {
            try {
                String threadName = Thread.currentThread().getName();
                System.out.println("[" + threadName + "] Started the execution to generate PDF using Process.");
                ProcessBuilder processBuilder = new ProcessBuilder("wkhtmltopdf", htmlFilePath, outputFilePath);
                processBuilder.redirectErrorStream(true);
                Process process = processBuilder.start();
                BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()));
                String line;
                while ((line = reader.readLine()) != null)
                    System.out.println("[" + threadName + "] Process Output: " + line);
                process.waitFor();
                System.out.println("[" + threadName + "] Execution is Completed.");
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        }
        private void createPDFUsingLib()
        {
            String threadName = Thread.currentThread().getName();
            System.out.println("[" + threadName + "] Started the execution to generate PDF using Lib.");
            File file = new File(outputFilePath);
            boolean result = HtmlToPdf.create()
                    .object(HtmlToPdfObject.forUrl(htmlFilePath))
                    .convert(file.getPath());
            System.out.println("[" + threadName + "] Is converted.. " + result);
        }
    }

    private static ArrayList<HashMap<String, String>> getDataList(String pdfFilePath)
    {
        ArrayList<HashMap<String, String>> list = new ArrayList<HashMap<String, String>>();
        HashMap<String, String> eachMap = null;

        eachMap = new HashMap<String, String>();
        eachMap.put("HTML_PATH", "https://github.com/jglick/jkillthread");
        eachMap.put("PDF_PATH", pdfFilePath + "jkillthread.pdf");
        list.add(eachMap);

        eachMap = new HashMap<String, String>();
        eachMap.put("HTML_PATH", "https://developer.paypal.com/docs/api/invoicing/v1/");
        eachMap.put("PDF_PATH", pdfFilePath + "paypalinvoice.pdf");
        list.add(eachMap);

        eachMap = new HashMap<String, String>();
        eachMap.put("HTML_PATH", "https://www.class-central.com/report/mooc-mba-top-b-schools/");
        eachMap.put("PDF_PATH", pdfFilePath + "mooc_mba.pdf");
        list.add(eachMap);

        eachMap = new HashMap<String, String>();
        eachMap.put("HTML_PATH", "https://developers.google.com/machine-learning/crash-course/prereqs-and-prework#prerequisites");
        eachMap.put("PDF_PATH", pdfFilePath + "ml.pdf");
        list.add(eachMap);

        eachMap = new HashMap<String, String>();
        eachMap.put("HTML_PATH", "https://www.khanacademy.org/math/cc-sixth-grade-math/cc-6th-expressions-and-variables/cc-6th-evaluating-expressions/v/expression-terms-factors-and-coefficients");
        eachMap.put("PDF_PATH", pdfFilePath + "kacademy.pdf");
        list.add(eachMap);

        eachMap = new HashMap<String, String>();
        eachMap.put("HTML_PATH", "https://machinelearningmastery.com/machine-learning-in-python-step-by-step/");
        eachMap.put("PDF_PATH", pdfFilePath + "mlstepbystep.pdf");
        list.add(eachMap);

        eachMap = new HashMap<String, String>();
        eachMap.put("HTML_PATH", "https://www.class-central.com/report/mooc-mba-top-b-schools/");
        eachMap.put("PDF_PATH", pdfFilePath + "new_mooc_mba.pdf");
        list.add(eachMap);

        return list;
    }
}
owexroasia commented 5 years ago

https://wkhtmltopdf.org/libwkhtmltox/

These binding are well documented and do not depend on QT. Using this is the recommended way of interfacing with the PDF portion of libwkhtmltox

fancywriter commented 4 years ago

@ymohammad your code indeed should create PDFs without any issues, but are you sure it's done in parallel really? I mean, does it 4 times faster in 4-cores processor with 4 threads to create a hundreds of random PDFs than just with 1 thread? That's the idea. Are you saying there is no difference between createPDFUsingProcess and createPDFUsingProcess in performance?

If the library has one synchronized block, it will be a bottleneck and the actual conversion will be sequential (and 3 of 4 threads will starve).

@owexroasia it is very strange if it doesn't depend on QT, since it uses qt-webkit to render HTML, am I right?..

@benbarkay do you have any news? Am I right that the only current way to do it in parallel is to actually run wkhtmltopdf in separate processes (which some other libraries without JNA do)? Only because this is how it is implemented inside native library itself?