Why is converting PDF to PDF slower than using gs directly?

gerritgriebel commented 10 years ago

Our Java webapp takes a form PDF, adds form data and sends it out by e-mail. On receivers side OmniPage 6 processes such PDF attachments and transforms them into PDF/A. Forms are unfortunately empty in converted PDF attachment. We have no way to influence processing on receivers side.

So I tried Ghostscripts PDF to PDF functionality on command line and was surprised that it does a good job of producing a static PDF. I learned that ghost4j's high level API is thread safe and could be used in a webapp. So I unzipped ghost4j on my development machine (Mac) and added three files to root directory:

input.pdf a sample 5 MB PDF with forms.
PDF2PDFExample.java

import java.io.File;
import java.io.FileOutputStream;
import java.io.ByteArrayOutputStream;
import java.io.ByteArrayInputStream;
import org.apache.commons.io.IOUtils;
import org.ghost4j.converter.PDFConverter;
import org.ghost4j.converter.PSConverter;
import org.ghost4j.document.PDFDocument;
import org.ghost4j.document.PSDocument;

public class PDF2PDFExample {

    public static void main(String[] args) {

        FileOutputStream fos = null;
        try{
            PDFDocument pdfInDocument = new PDFDocument();
            PSDocument psTempDocument = new PSDocument();
            PDFDocument pdfOutDocument = new PDFDocument();

            // load PDF
            pdfInDocument.load(new File("input.pdf"));
            ByteArrayOutputStream psOutStream = new ByteArrayOutputStream();

            // convert to PostScript document
            PSConverter psConverter = new PSConverter();
            psConverter.convert(pdfInDocument, psOutStream);
            byte[] psByteArray = psOutStream.toByteArray();
            ByteArrayInputStream psInStream = new ByteArrayInputStream(psByteArray);
            psTempDocument.load(psInStream);

            // convert back to PDF document
            PDFConverter pdfConverter = new PDFConverter();
            pdfConverter.setPDFSettings(PDFConverter.OPTION_PDFSETTINGS_DEFAULT);
            FileOutputStream pdfOutputStream = new FileOutputStream(new File("output-java.pdf"));
            pdfConverter.convert(psTempDocument, pdfOutputStream);

        } catch (Exception e) {
            System.out.println("ERROR: " + e.getMessage());
        } finally{
            IOUtils.closeQuietly(fos);
        }
    }
}

and

test.sh

#!/bin/bash
export LD_LIBRARY_PATH=/opt/local/lib
export CLASSPATH=$(find . -name \*.jar | tr '[:space:]' ':')
javac PDF2PDFExample.java
time java PDF2PDFExample
time gs -sDEVICE=pdfwrite -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output-gs.pdf input.pdf
ls -l input.pdf output-java.pdf output-gs.pdf

Ghostscript was already installed on my system using macports port "ghostscript", I had to set $LD_LIBRARY_PATH tough, see above script. Then I ran the script. Output:

$ ./test.sh

real    0m11.407s
user    0m10.217s
sys 0m0.749s

real    0m4.808s
user    0m4.256s
sys 0m0.129s
-rw-r--r--@ 1 gg  staff  5157540 20 Feb 14:01 input.pdf
-rw-r--r--  1 gg  staff  1048975 21 Feb 19:02 output-gs.pdf
-rw-r--r--  1 gg  staff  1253399 21 Feb 19:02 output-java.pdf

Questions:

Seems it takes twice as long using my approach. Is there a better way to convert PDF to PDF using thread safe java?
Why is the output of those two invocations different? I thought calling gs would also convert pdf to ps and ps to pdf, as I did in above java code.

zippy1978 commented 10 years ago

Hi,

Using the high level API is slower because the input document is loaded and parsed... If you don't care about manipulating the document you should use the core API like this : http://www.ghost4j.org/coreapisamples.html.

However to be thread safe consider using a queue to process one document at a time.

Regards

gerritgriebel commented 10 years ago

Hi, thanks for your prompt answer and ghost4j, of course :). Ah, I added to above test script two more lines to let gs do the same, i.e. separate pdf to ps and ps to pdf:

time gs -sDEVICE=ps2write -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output-gs.ps input.pdf
time gs -sDEVICE=pdfwrite -dPDFSETTINGS=/default -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output-gs-ps.pdf output-gs.ps

and is now parsed the same way w/o Java. Output:

real    0m12.183s
user    0m10.186s
sys 0m0.785s

real    0m5.171s
user    0m4.242s
sys 0m0.131s

real    0m3.855s
user    0m2.953s
sys 0m0.370s

real    0m6.094s
user    0m5.170s
sys 0m0.216s
-rw-r--r--@ 1 gg  staff  5157540 20 Feb 14:01 input.pdf
-rw-r--r--@ 1 gg  staff  1253400 23 Feb 18:52 output-gs-ps.pdf
-rw-r--r--  1 gg  staff  1048975 23 Feb 18:51 output-gs.pdf
-rw-r--r--  1 gg  staff  1253399 23 Feb 18:51 output-java.pdf

So now we have about 6+4=10 seconds for invokeing gs twice and 12 seconds with ghost4j high level API. And output files of output-gs-ps.pdf and output-java.pdf are nearly the same (1 byte difference). I see three possible ways:

use above Java code and live with the performance as it is
implement queueing and call gs using core API as you suggested
Implement a new converter that does PDF2PDF in one go I think for the time being I stick with 1 and leave 3 as a suggestion for you :)

zippy1978 commented 10 years ago

Thank you for your feedback !

zippy1978 / ghost4j

Why is converting PDF to PDF slower than using gs directly? #30