Investigate pulling text directly from PDFs instead of via poppler -> HTML

wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning

MIT License

25 stars 4 forks source link

Investigate pulling text directly from PDFs instead of via poppler -> HTML #232

Closed hblanks closed 5 years ago

hblanks commented 5 years ago

As @jdu noted on https://github.com/wellcometrust/policytool/issues/227, we'd do well to look into working directly from PDFs instead of the current PDF -> HTML -> BeautifulSoup route.

Scope for a 3-5 day investigation, I'd think..?

jdu commented 5 years ago

Sorry, in #227 I mentioned pikepdf, pikepdf is more for manipulating pdfs than reading text, the library i've used most recently (for python) that worked fairly well for this was pdfminer.six

Another option is to look at libraries implemented in Go, Rust, C++ and see if there are available python wrappers for them or implement wrappers/FFIs for them in python as those libraries will be more performant than native python implementations.

aCampello commented 5 years ago

I have been successful using Tika in the past, that has a python wrapper (https://github.com/chrismattmann/tika-python).

ivyleavedtoadflax commented 5 years ago

I've also had good results with using pdftotext (https://pypi.org/project/pdftotext/) for pdfs containing text, and with tesseract (https://github.com/tesseract-ocr/tesseract) for OCR.

nsorros commented 5 years ago

i have heard good things about tika and pdftotext and funningly i will have to use them for my firebreak project.

ivyleavedtoadflax commented 5 years ago

I've done some relevant work on this in https://github.com/wellcometrust/datalabs/issues/404 and https://github.com/wellcometrust/datalabs/pull/417 (still WIP).

ivyleavedtoadflax commented 5 years ago

I've finished some analysis of this, and it looks like a combination of pdftohtml (as we use currently) and lxml will provide near perfect results. In absolute terms, this combination found over 1000 more references than the current Reach baseline on four troublesome documents. I've opened https://github.com/wellcometrust/policytool/pull/243 for discussion.