Fix PDF extraction for OS X(in pdf_parser.py)

wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning

MIT License

25 stars 4 forks source link

Fix PDF extraction for OS X(in pdf_parser.py) #273

Open lizgzil opened 5 years ago

lizgzil commented 5 years ago

After some tests failing when I ran make test on my mac computer @ivyleavedtoadflax realised there is an error in reach/pdf_parser/pdf_parser.py in which the text extraction works as desired for Linux only.

Line 89 in pdf_parser.py: full_text = '\n'.join([_flatten_text(text) for text in tree.xpath('//text')])

e.g. for the test pdf in https://github.com/wellcometrust/reach/blob/master/reach/scraper/tests/pdfs/test_pdf_multipage.pdf, my mac outputs 18 lines:

Test Page 1
All bold line.
Partly
bold
 line.
All italic line.
Partly
italic
 line.
Test Page 2
All bold line.
Partly
bold
 line.
All italic line.
Partly
italic
 line.

whereas we want to output the be 10 lines

Test Page 1
All bold line.
Partly bold line.
All italic line.
Partly italic line.
Test Page 2
All bold line.
Partly bold line.
All italic line.
Partly italic line.

ivyleavedtoadflax commented 5 years ago

It looks like the formatted portions - i.e. bold and italic have a text attribute that is being found by tree.xpath('\\text'). Why that is present on mac and not linux is anyone's guess, but either pdf2html or lxml is performing differently on mac.

nsorros commented 5 years ago

that sounds super weird and important to resolve.

jdu commented 5 years ago

This is I think actually down to the version of poppler you have installed and not to do with a difference between Linux / macos. Homebrew on mac seems to be distributing version 0.81 of poppler which has some regressions/issues from the look of their issue list, especially around line breaks and things.

The linux distributions are around version 0.71 which don't seem to be a problem, so as long as we pin our install of poppler to <0.81 on linux we should be OK.

Edit: Just to clarify what the root cause is here.

Specifically, the issue is around how poppler is breaking up lines into XML. It seems to prioritize style differences over proximity in >=0.81. Which means that "Partly bold line" is considered as three distinct text strings.

<text>Partly</text>
<text>bold</text>
<text>line</text>

So an XPath query in lxml will return three distinct elements (correctly).

So the culprit here is poppler and how it interprets PDF object positioning and styling to form the output XML.

hblanks commented 5 years ago

Alas. Some options I can think of here:

Annotating or otherwise altering the poppler-dependent tests so they run with make docker-test but not make test when run on OS X.
(More painfully) requiring the earlier version of poppler be installed on OS X
(Hard, but maybe best going forward?) Advancing the poppler version in Linux and fixing the code/tests to work with this.

Are there other ways, though? Or is one of these a good choice? What do people think?

(Aside: It's interesting to note that our current test, which passes on 0.71, is essentially a "Progression test" in that it verifies that the library we're using has not been upgraded to a version that we can't use yet.)

jdu commented 5 years ago

Rather than trying to work around the versioning from package managers, popplers github repo does use version tags, so we could just compile the version we want for linux and macos and carry the dependencies rather than relying on the system package management. Then we're exerting a more strict control over the environment to ensure consistency for everyone.

In reach, we're calling it directly from a subprocess I believe so there wouldn't be any issues around having to monkey-patch an intermediary python library or anything, we just make sure that the path called in the subprocess call aligns with the path for our pre-compiled binaries on the target system.

There is a version 0.82 in the poppler repo as well which might resolve this issue, but hasn't been propagated to the various package systems (i.e. homebrew, apt, etc...).

ivyleavedtoadflax commented 5 years ago

(Hard, but maybe best going forward?) Advancing the poppler version in Linux and fixing the code/tests to work with this.

Agree this is probably the best solution going forward. I will have a look since I wrote the switch to lxml, but it probably will not be this week. Very happy if someone else wants to!

jdu commented 5 years ago

Update: We're on stretch in debian which means we're actually way back on poppler version 0.48...

I'm actually not 100% confident this is a bug in poppler yet to be honest. The difference in versions between what's in stretch and what's in homebrew is massive, so I'm wondering if somewhere between those two versions this became normal behaviour, there are lots of complaints in the issues tracker for poppler but nothing I've seen that links up with what we're seeing.

In terms of the options @hblanks proposed:

Annotating or otherwise altering the poppler-dependent tests so they run with make docker-test but not make test when run on OS X.

(More painfully) requiring the earlier version of poppler be installed on OS X

(Hard, but maybe best going forward?) Advancing the poppler version in Linux and fixing the code/tests to work with this.

This feels a bit like shunting the problem down the line and i'm worried we'll have to deal with this later, but this right now is probably the easiest, fastest and lowest impact option of the three.
I don't think that homebrew has alternative versions for poppler at the moment and the version available in debian-stretch in the docker images is actually much much older than 0.71 on second look (the most recent version listed in apt for stretch is 0.48.0). So installing on macos may have to involve getting a distribution from somewhere else.
streatch does not have anything more recent than 0.48.0 for poppler-utils in its repos, in fact none of the debian releases currently have a version newer than 0.71 in their repos so even upgrading to jessie or later would only land us with 0.71 and we would have to find a way to install 0.71 on macos still.

The only other options I can think of:

Custom-compile the binaries for macos and debian-stretch for a static newer version of poppler-utils and include them in the repo. This has the added bonus of us getting more fine-grained control over which version of poppler we're using as well as potentially simplifying deployment as we don't need to install poppler-utils potentially (although we may have to install it's dependencies like cairo). But this could be an utter mess to get compilation working statically for the different platforms.
Switch to alpine for the docker image as it has 0.81 and 0.82 versions of poppler available and then address the issue of line aggregates. https://pkgs.alpinelinux.org/packages?name=poppler-utils&branch=edge This could open up a can of worms though as we have to update all our container setups, etc... to be in line with alpine. We should be addressing this anyway as debian-stretch reaches EOL at end of January as well.

Right now I'm leaning towards Hunters Option 1 or my Option 2 as this has the potential to be a rabbit-warren problem, and we can look a little further down the line at how we address our dependency on poppler overall based on the options we have.

ivyleavedtoadflax commented 5 years ago

Oh man :woman_facepalming: this is turning into an annoyingly complex bug.

I guess it's worth asking the question of whether we actually need to fix it at all right now. Its only a problem for development, and if do that in docker this problem goes away. Given that the switch to lxml must have been month or so ago, and no one has noticed it until now (meaning no one ran the tests on a Mac) I just wonder how much of a priority it is? But I would say that... I'm on Linux...

hblanks commented 5 years ago

A couple more points of info:

It's probably easiest to stay on Debian (almost all other images are, and the package tooling is a bit more straightforward), though upgrading to the 3.6.9-slim-buster wouldn't hurt.
Installing a more recent poppler version into the Docker image isn't all that bad.
But, even if we do this, you still have the challenge of keeping versions in sync b/w Docker and local platforms like OS X or Matt's linux distro.

One other path would be to detect the poppler version and work around what you've got.

However it is, though, it'd be good to get unit tests working OK for all platforms.

jdu commented 5 years ago

This might not be all that bad, I think if we can at least as a first start get all three contexts (i.e. whatever distribution @ivyleavedtoadflax is running, macos and debian) at least close in versionings then we can get a clearer picture of the disparaties between the different versions across our contexts.

Some of distributions vary in the backend that they are using (qt4, qt5, cairo, etc...) and I'm not sure if that's OS specific or packagers choice, but it would be good to get as close to major version equivalence here to at least see if the behaviour above is normal and address it as it's going to directly impact the parser and we'll need to implement some changes to accomodate how the updated version outputs XML if that ends up being normal behaviour in later versions.

I don't mind playing whack-a-mole once in a while getting the versions across the different platforms to parity every once in a blue moon, I'm more concerned with having stable, reliable and repdroducible tests, builds, etc...

ivyleavedtoadflax commented 5 years ago

So I'm on Ubuntu 18.04.3 LTS but can of course update poppler above 0.62.0 (which is currently in this distribution). So what's a good way forward? Shall we start by trying to get all these systems onto the same version of poppler (presumably the latest), and then I can look again at fixing the text extraction logic if it remains broken?

jdu commented 5 years ago

Actually, overlooking a fairly simple fix here.

Generally in a lot of places the development environment is normalized. Teams are on similar OSes grouped around the contexts they're working under (at least in enterprise).

Instead of fighting with the versioning between host OS, docker images too much. We can normalize the Host OS by setting up a Vagrant environment for the project which gives us better control over the Host OS for the docker-runtime and allows us to simplify and and remove the need for installing/polluting our host OS.

This means we could set up a Vagrantfile with debian-slim to wrap the project and normalize the version of poppler installed between the docker OS and the Host OS more easily.

I know vagrant has fallen out of popularity a bit in recent years, but I still use it pretty heavily for easier access to building artefacts across different distros. It might be an easy fix in the near-term for this so that we can get tests passing locally across everyones machines and smooth out any inconsistencies between Host environments.

hblanks commented 5 years ago

Thanks for the comment, @jdu! Although, is this different than make docker-test? That seems to give us the same level of control, modulo kernel version at least.

If so, the change here could be renaming make test -> make local-test and make docker-test -> make test.

jdu commented 5 years ago

Vagrant is a bit more persistent than a docker container I think but it's along the same lines. Essentially what we would end up with is a VM running debian-slim that you can vagrant ssh in and out of with a folder in the vagrant image mapped to the working directory you're in. You can provision the VM in the Vagrantfile and it's all nice and isolated, i've pasted a previous one below:

$script = <<-SCRIPT
sudo add-apt-repository "deb https://apt.dockerproject.org/repo/ ubuntu-$(lsb_release -cs) main"
curl -fsSL https://yum.dockerproject.org/gpg | sudo apt-key add -

sudo curl -L "https://github.com/docker/compose/releases/download/1.22.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose && \
  sudo chmod +x /usr/local/bin/docker-compose

curl -SL https://github.com/grammarly/rocker/releases/download/1.3.0/rocker_linux_amd64.tar.gz | sudo tar -xzC /usr/local/bin && \
  sudo chmod +x /usr/local/bin/rocker

sudo apt-get update;
sudo apt-get install -y make python docker-engine;
sudo apt-get install -y docker-ce;
sudo docker login -u "jeff.uren@<x>.com" -p "<x>" dev.<x>.com;
sudo docker login -u "jeff.uren" -p "<x>" dev-registry.<x>.com;
SCRIPT

Vagrant.configure("2") do |config|
    config.vm.box = "ubuntu/xenial64"
    # We need a larger disk size as appcheck is pretty massive
    config.disksize.size = "100GB"

    # Configure the provider
    config.vm.provider "virtualbox" do |v|
        v.memory = 8192
        v.cpus = 2
        v.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
    end

    # Provision tools for building in the vm
    config.vm.provision "shell", inline: $script

    # Forward internal ports to accessible ports in host
    config.vm.network "forwarded_port", guest: 80, host: 8082
    config.vm.network "forwarded_port", guest: 443, host: 8083
    config.vm.network "forwarded_port", guest: 15672, host: 15672
    config.vm.network "forwarded_port", guest: 4443, host: 8084

    # Use synced folders as NFS mount, bog standard vagrant mount is too slow and causes the
    # application to be unusable
    config.vm.synced_folder "./", "/opt/a<project>", create: true, type: "nfs"
    config.vm.network "private_network", ip: "192.168.10.100"
    config.vm.network "public_network", ip: "192.168.20.200"
end

Inside that VM we would have a consistent version of the docker runtime, consistent builds of the dependencies for tests, etc... You can easily pause the vagrant vm and recreate it without running the overall docker runtime on the Host OS. So it creates clearer delineation from the working space and the host OS. We can map the ports from the various docker containers inside the vagrant vm to ports on the host OS, etc..., install useful utilities into the vagrant image for debugging, etc... without having to reason about installing those dependencies on differing Host OSes.

Some benefits of this off the top of my head:

1) Reduce onboarding time as a new dev only needs to run vagrant up to get the dev environment running. 2) Smooth out inconsistencies like poppler 3) Avoid issues with differing python versions, library / package versions on the Host OS further down the road 4) Use a version of docker in line with the one AWS is using for builds to avoid any version inconsistencies there. 5) Easily test new versions of dependencies without messing with the Host OS packages, etc... as you can just do vagrant destroy && vagrant up to kill the image and re-provision it.

I've had a couple projects where what I've done is set up the make file to trigger functionality in the vagrant VM from the users Host OS so it feels pretty seamless and they don't need to SSH into the image to execute things like for instance running evaluate_algo, which could be a make command which calls the method inside the vagrant file over ssh.

It also means that the make docker-test can just run in vagrant instead of spinning up a container to run it as we can install it's dependencies into the vagrant VM.

The only con I can think off of the top of my head is the indirection, running things like docker-compose logs means you need to have vagrant ssh'd into the vagrant VM, but that might be a small problem in comparison to the smoothing that comes out of it.

hblanks commented 5 years ago

All fair points -- and I'll say this is more for you all to pick than me.

But, if it's OK to offer a few points, as someone who has maintained full-stack Vagrant dev environments for teams of 6-10 (and who's happy not to be going back). I'll start with notes to your 5 above, then add a few others, if that's OK:

Onboarding: this one's complicated, but it's probably good to keep in mind that (1) getting Vagrant up involves a few other steps, including installing VirtualBox -- arguably as many or more than Docker, and (2) some devs, but particularly data scientists, will want to run code locally anyway, especially if they need to test things on a GPU, which will be much harder to do within a VM.
Library inconsistencies: we do have this at present with any Docker-defined environment, such as the one used for the live localdev environment (via docker-compose) or for testing (via make docker-test).
Differing Python versions: as above.
Docker version inconsistency: I don't think I've ever seen, or for that matter, heard of an issue here. This may have to do with the relative stability of Docker as an enterprise (and production-targeted) product -- and also with the more limited nature of its APIs and DSL (i.e., Dockerfile). Another thing to keep in mind is that the container runtime in production is not necessarily Docker anyway. It will, however, conform to the Open Containers Initiative's specs, so I don't think we have to worry too much on this.
Easily test new versions of dependencies: this also seems to be well covered with Docker, and faster, since specifying a new image (and running commands in a Dockerfile, each of which gets cached into a layer) tends to be faster than scripting out configuration of a complete OS. Similarly, downloading and applying images is usually at least twice as fast as having the OS's package manager apply a larger number of individual package downloads and installation scripts.

The other notes:

It's probably changed now that Vagrant is less heavily maintained, but a fairly consistent issue was keeping people up to date on Vagrant versions. Like terraform and most of Mitchell's projects, Vagrant's coupled to a DSL that seems to (or necessarily must) change and break earlier usage.
On indirection: this may have changed, but sharing file edits between the host environment and the guest was a genuine bane for frontend development in Vagrant, thanks to the difficult coupling of VirtualBox's shared folders with OS X and the Linux host. I can't promise that Docker's voluming works better on OS X, but it does seem to. (And on Linux, of course, it's a non-issue.)
The deployment environment uses Docker images, and the CI environment as well has Docker (but not Vagrant), so we would still want docker-test to run in Docker. Similarly, you're still going to need Docker for the different dependencies brought up in docker-compose. We just can't get this kind of easy functionality (networking, off-the-shelf postgres & ElasticSearch images, etc.) out of Vagrant.
Persistent state tends to be a pretty big issue in Vagrant -- I can't really count the number of times I've had to ask people to run vagrant destroy to wipe their state. These things become non-issues with Docker, since container state is discarded on exit by default.
From an open source perspective, I suspect you'll have easier time getting contributors with a dependency on Docker (which most any systems project requires) than on Vagrant and VirtualBox. To some extent that's just a question of popularity -- I don't think either Vagrant or VirtualBox are all that often used anymore -- but it does also stem from the faster run times, fewer gotchas (no persistent state, a simpler DSL and standardized spec), and clear tie-in to production environments that come with docker development and deployment.

All this said, it's still also true that my own experience maintaining a localdev Vagrant setup for teams over the years jaded me. Still, I do suggest you think somewhat carefully about this before picking it up. Docker's a dependency we can't get away from, and it will allow us all the specificity we need. Vagrant may be a dependency you can avoid, and if so, you may be best doing so.

jdu commented 5 years ago

Sorry, this is going out into left-field a bit and diverging from the original issue into too many other things a bit too much I think (totally my fault).

I think at least for now, if we drop the poppler tests out of the local testing as you @hblanks originally suggested we can at least get tests passing, we should be more concerned that those tests pass in the deployment environment (i.e. inside the docker images) than whether they work locally on the given devs environment and we look at addressing how we get poppler inside the deployment environment to a newer version at least close to what we've got in the host Oses (0.71, 0.81) so we can suss out whether we need to change how we handle the text coming out of poppler based on the output we're getting from the newer version above.