Closed hblanks closed 5 years ago
Sorry, in #227 I mentioned pikepdf, pikepdf is more for manipulating pdfs than reading text, the library i've used most recently (for python) that worked fairly well for this was pdfminer.six
Another option is to look at libraries implemented in Go, Rust, C++ and see if there are available python wrappers for them or implement wrappers/FFIs for them in python as those libraries will be more performant than native python implementations.
I have been successful using Tika
in the past, that has a python wrapper (https://github.com/chrismattmann/tika-python).
I've also had good results with using pdftotext (https://pypi.org/project/pdftotext/) for pdfs containing text, and with tesseract (https://github.com/tesseract-ocr/tesseract) for OCR.
i have heard good things about tika and pdftotext and funningly i will have to use them for my firebreak project.
I've done some relevant work on this in https://github.com/wellcometrust/datalabs/issues/404 and https://github.com/wellcometrust/datalabs/pull/417 (still WIP).
I've finished some analysis of this, and it looks like a combination of pdftohtml
(as we use currently) and lxml
will provide near perfect results. In absolute terms, this combination found over 1000 more references than the current Reach baseline on four troublesome documents. I've opened https://github.com/wellcometrust/policytool/pull/243 for discussion.
As @jdu noted on https://github.com/wellcometrust/policytool/issues/227, we'd do well to look into working directly from PDFs instead of the current PDF -> HTML -> BeautifulSoup route.
Scope for a 3-5 day investigation, I'd think..?