xpmethod / opensyllabus

Other
48 stars 10 forks source link

Compare Python PDF extraction libraries with sample files #30

Open mgorenstein opened 10 years ago

mgorenstein commented 10 years ago

Write up mini-paper comparing performance of various text-extractors on a document with available plaintext (possibly a particular edition of the bible).

grahamsack commented 10 years ago

Hi Mark -- I experimented a bit with PyPDF and PDFMiner for syllabus extraction. PyPDF seemed to be smoother to work with. I tried extracting syllabi into plain text and html. The extractors captured most of the text correctly but, in cases where the formatting was complicated or had lots of tables, it jumbled the order.

Best,

Graham

On Thu, May 22, 2014 at 12:41 PM, Mark Gorenstein notifications@github.comwrote:

Write up mini-paper comparing performance of various text-extractors on a document with available plaintext (possibly a particular edition of the bible).

— Reply to this email directly or view it on GitHubhttps://github.com/dhcolumbia/opensyllabus/issues/30 .

denten commented 10 years ago

We are planning to do a more formal comparison. Stay tuned.

grahamsack commented 10 years ago

If you want to leverage it, I put my code for the extractor in the opensyllabus/Classifiers folder.

mgorenstein commented 10 years ago

Thanks, Graham.

mgorenstein commented 10 years ago

Libraries

Source Texts

mgorenstein commented 10 years ago

I'm going to move ahead with V1 given these resources. I'll make the platform flexible enough to support the addition of other PDF extractors in case we come across any serious contenders that I've missed.

Graham and Dennis: let me know if you have any suggestions, especially with the selection of source texts. I went with P&P because it's in the public domain, was written in English, and has a range of released PDFs.

grahamsack commented 10 years ago

I had read about slate while looking into PDFMiner and I thought it sounded very good and comparatively user-friendly, but I wasn't able to get it working due to a dependency issue I was never able to resolve. If you can it working, that's great as it sounds like a good library.

On Sun, May 25, 2014 at 1:29 PM, Mark Gorenstein notifications@github.comwrote:

I'm going to move ahead with V1 given these resources. I'll make the platform flexible enough to support the addition of other PDF extractors in case we come across any serious contenders that I've missed.

Graham and Dennis: let me know if you have any suggestions, especially with the selection of source texts. I went with P&P because it's in the public domain, was written in English, and has a range of released PDFs.

— Reply to this email directly or view it on GitHubhttps://github.com/dhcolumbia/opensyllabus/issues/30#issuecomment-44140539 .

mrenoch commented 10 years ago

This could be worth checking out, to unify and maybe simplify text extraction:

http://datascopeanalytics.com/what-we-think/2014/07/27/extract-text-from-any-document-no-muss-no-fuss

https://github.com/deanmalmgren/textract

samzhang111 commented 9 years ago

Jumping in after not contributing very much... I'm familiar with some of the people who are maintaining Apache Tika out at NASA JPL. It is a project that has a strong core team of developers, and has overlapping goals with textract. The advantage to Tika (and textract) is that you don't need separate logic for each document format, and you also get standard metadata for each document.

Tika wraps around pdfbox for pdf documents, which performed a 6 second extraction in the benchmarking stats file. I bet the slowness was caused by the bootup time of the JVM, though. If you separate the JVM initialization code with the conversion, I imagine it's more in the range of the pure-python extractors. This is how I've used Tika in python in the past: www.hackzine.org/using-apache-tika-from-python-with-jnius.html

Cheers, Sam

chrismattmann commented 9 years ago

Thanks @samzhang111 yep, happy to provide any info here on Tika if it helps

chrismattmann commented 9 years ago

Coming back here, just FYI we have a fully supported Tika port to Python using the JAX-RS REST server. FYI: https://github.com/chrismattmann/tika-python