Open lizgzil opened 5 years ago
It looks like the formatted portions - i.e. bold and italic have a text attribute that is being found by tree.xpath('\\text')
. Why that is present on mac and not linux is anyone's guess, but either pdf2html
or lxml
is performing differently on mac.
that sounds super weird and important to resolve.
This is I think actually down to the version of poppler you have installed and not to do with a difference between Linux / macos. Homebrew on mac seems to be distributing version 0.81 of poppler which has some regressions/issues from the look of their issue list, especially around line breaks and things.
The linux distributions are around version 0.71 which don't seem to be a problem, so as long as we pin our install of poppler to <0.81 on linux we should be OK.
Edit: Just to clarify what the root cause is here.
Specifically, the issue is around how poppler is breaking up lines into XML. It seems to prioritize style differences over proximity in >=0.81. Which means that "Partly bold line" is considered as three distinct text strings.
<text>Partly</text>
<text>bold</text>
<text>line</text>
So an XPath query in lxml will return three distinct elements (correctly).
So the culprit here is poppler and how it interprets PDF object positioning and styling to form the output XML.
Alas. Some options I can think of here:
make docker-test
but not make test
when run on OS X.Are there other ways, though? Or is one of these a good choice? What do people think?
(Aside: It's interesting to note that our current test, which passes on 0.71, is essentially a "Progression test" in that it verifies that the library we're using has not been upgraded to a version that we can't use yet.)
Rather than trying to work around the versioning from package managers, popplers github repo does use version tags, so we could just compile the version we want for linux and macos and carry the dependencies rather than relying on the system package management. Then we're exerting a more strict control over the environment to ensure consistency for everyone.
In reach, we're calling it directly from a subprocess I believe so there wouldn't be any issues around having to monkey-patch an intermediary python library or anything, we just make sure that the path called in the subprocess call aligns with the path for our pre-compiled binaries on the target system.
There is a version 0.82 in the poppler repo as well which might resolve this issue, but hasn't been propagated to the various package systems (i.e. homebrew, apt, etc...).
(Hard, but maybe best going forward?) Advancing the poppler version in Linux and fixing the code/tests to work with this.
Agree this is probably the best solution going forward. I will have a look since I wrote the switch to lxml, but it probably will not be this week. Very happy if someone else wants to!
Update: We're on stretch in debian which means we're actually way back on poppler version 0.48...
I'm actually not 100% confident this is a bug in poppler yet to be honest. The difference in versions between what's in stretch and what's in homebrew is massive, so I'm wondering if somewhere between those two versions this became normal behaviour, there are lots of complaints in the issues tracker for poppler but nothing I've seen that links up with what we're seeing.
In terms of the options @hblanks proposed:
- Annotating or otherwise altering the poppler-dependent tests so they run with make docker-test but not make test when run on OS X.
- (More painfully) requiring the earlier version of poppler be installed on OS X
- (Hard, but maybe best going forward?) Advancing the poppler version in Linux and fixing the code/tests to work with this.
The only other options I can think of:
Right now I'm leaning towards Hunters Option 1 or my Option 2 as this has the potential to be a rabbit-warren problem, and we can look a little further down the line at how we address our dependency on poppler overall based on the options we have.
Oh man :woman_facepalming: this is turning into an annoyingly complex bug.
I guess it's worth asking the question of whether we actually need to fix it at all right now. Its only a problem for development, and if do that in docker this problem goes away. Given that the switch to lxml must have been month or so ago, and no one has noticed it until now (meaning no one ran the tests on a Mac) I just wonder how much of a priority it is? But I would say that... I'm on Linux...
A couple more points of info:
3.6.9-slim-buster
wouldn't hurt.One other path would be to detect the poppler version and work around what you've got.
However it is, though, it'd be good to get unit tests working OK for all platforms.
This might not be all that bad, I think if we can at least as a first start get all three contexts (i.e. whatever distribution @ivyleavedtoadflax is running, macos and debian) at least close in versionings then we can get a clearer picture of the disparaties between the different versions across our contexts.
Some of distributions vary in the backend that they are using (qt4, qt5, cairo, etc...) and I'm not sure if that's OS specific or packagers choice, but it would be good to get as close to major version equivalence here to at least see if the behaviour above is normal and address it as it's going to directly impact the parser and we'll need to implement some changes to accomodate how the updated version outputs XML if that ends up being normal behaviour in later versions.
I don't mind playing whack-a-mole once in a while getting the versions across the different platforms to parity every once in a blue moon, I'm more concerned with having stable, reliable and repdroducible tests, builds, etc...
So I'm on Ubuntu 18.04.3 LTS but can of course update poppler above 0.62.0 (which is currently in this distribution). So what's a good way forward? Shall we start by trying to get all these systems onto the same version of poppler (presumably the latest), and then I can look again at fixing the text extraction logic if it remains broken?
Actually, overlooking a fairly simple fix here.
Generally in a lot of places the development environment is normalized. Teams are on similar OSes grouped around the contexts they're working under (at least in enterprise).
Instead of fighting with the versioning between host OS, docker images too much. We can normalize the Host OS by setting up a Vagrant environment for the project which gives us better control over the Host OS for the docker-runtime and allows us to simplify and and remove the need for installing/polluting our host OS.
This means we could set up a Vagrantfile with debian-slim to wrap the project and normalize the version of poppler installed between the docker OS and the Host OS more easily.
I know vagrant has fallen out of popularity a bit in recent years, but I still use it pretty heavily for easier access to building artefacts across different distros. It might be an easy fix in the near-term for this so that we can get tests passing locally across everyones machines and smooth out any inconsistencies between Host environments.
Thanks for the comment, @jdu! Although, is this different than make docker-test
? That seems to give us the same level of control, modulo kernel version at least.
If so, the change here could be renaming make test
-> make local-test
and make docker-test
-> make test
.
Vagrant is a bit more persistent than a docker container I think but it's along the same lines. Essentially what we would end up with is a VM running debian-slim that you can vagrant ssh
in and out of with a folder in the vagrant image mapped to the working directory you're in. You can provision the VM in the Vagrantfile and it's all nice and isolated, i've pasted a previous one below:
$script = <<-SCRIPT
sudo add-apt-repository "deb https://apt.dockerproject.org/repo/ ubuntu-$(lsb_release -cs) main"
curl -fsSL https://yum.dockerproject.org/gpg | sudo apt-key add -
sudo curl -L "https://github.com/docker/compose/releases/download/1.22.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose && \
sudo chmod +x /usr/local/bin/docker-compose
curl -SL https://github.com/grammarly/rocker/releases/download/1.3.0/rocker_linux_amd64.tar.gz | sudo tar -xzC /usr/local/bin && \
sudo chmod +x /usr/local/bin/rocker
sudo apt-get update;
sudo apt-get install -y make python docker-engine;
sudo apt-get install -y docker-ce;
sudo docker login -u "jeff.uren@<x>.com" -p "<x>" dev.<x>.com;
sudo docker login -u "jeff.uren" -p "<x>" dev-registry.<x>.com;
SCRIPT
Vagrant.configure("2") do |config|
config.vm.box = "ubuntu/xenial64"
# We need a larger disk size as appcheck is pretty massive
config.disksize.size = "100GB"
# Configure the provider
config.vm.provider "virtualbox" do |v|
v.memory = 8192
v.cpus = 2
v.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
end
# Provision tools for building in the vm
config.vm.provision "shell", inline: $script
# Forward internal ports to accessible ports in host
config.vm.network "forwarded_port", guest: 80, host: 8082
config.vm.network "forwarded_port", guest: 443, host: 8083
config.vm.network "forwarded_port", guest: 15672, host: 15672
config.vm.network "forwarded_port", guest: 4443, host: 8084
# Use synced folders as NFS mount, bog standard vagrant mount is too slow and causes the
# application to be unusable
config.vm.synced_folder "./", "/opt/a<project>", create: true, type: "nfs"
config.vm.network "private_network", ip: "192.168.10.100"
config.vm.network "public_network", ip: "192.168.20.200"
end
Inside that VM we would have a consistent version of the docker runtime, consistent builds of the dependencies for tests, etc... You can easily pause the vagrant vm and recreate it without running the overall docker runtime on the Host OS. So it creates clearer delineation from the working space and the host OS. We can map the ports from the various docker containers inside the vagrant vm to ports on the host OS, etc..., install useful utilities into the vagrant image for debugging, etc... without having to reason about installing those dependencies on differing Host OSes.
Some benefits of this off the top of my head:
1) Reduce onboarding time as a new dev only needs to run vagrant up
to get the dev environment running.
2) Smooth out inconsistencies like poppler
3) Avoid issues with differing python versions, library / package versions on the Host OS further down the road
4) Use a version of docker in line with the one AWS is using for builds to avoid any version inconsistencies there.
5) Easily test new versions of dependencies without messing with the Host OS packages, etc... as you can just do vagrant destroy && vagrant up
to kill the image and re-provision it.
I've had a couple projects where what I've done is set up the make file to trigger functionality in the vagrant VM from the users Host OS so it feels pretty seamless and they don't need to SSH into the image to execute things like for instance running evaluate_algo, which could be a make command which calls the method inside the vagrant file over ssh.
It also means that the make docker-test
can just run in vagrant instead of spinning up a container to run it as we can install it's dependencies into the vagrant VM.
The only con I can think off of the top of my head is the indirection, running things like docker-compose logs
means you need to have vagrant ssh
'd into the vagrant VM, but that might be a small problem in comparison to the smoothing that comes out of it.
All fair points -- and I'll say this is more for you all to pick than me.
But, if it's OK to offer a few points, as someone who has maintained full-stack Vagrant dev environments for teams of 6-10 (and who's happy not to be going back). I'll start with notes to your 5 above, then add a few others, if that's OK:
make docker-test
).The other notes:
terraform
and most of Mitchell's projects, Vagrant's coupled to a DSL that seems to (or necessarily must) change and break earlier usage.docker-test
to run in Docker. Similarly, you're still going to need Docker for the different dependencies brought up in docker-compose
. We just can't get this kind of easy functionality (networking, off-the-shelf postgres & ElasticSearch images, etc.) out of Vagrant.vagrant destroy
to wipe their state. These things become non-issues with Docker, since container state is discarded on exit by default.All this said, it's still also true that my own experience maintaining a localdev Vagrant setup for teams over the years jaded me. Still, I do suggest you think somewhat carefully about this before picking it up. Docker's a dependency we can't get away from, and it will allow us all the specificity we need. Vagrant may be a dependency you can avoid, and if so, you may be best doing so.
Sorry, this is going out into left-field a bit and diverging from the original issue into too many other things a bit too much I think (totally my fault).
I think at least for now, if we drop the poppler tests out of the local testing as you @hblanks originally suggested we can at least get tests passing, we should be more concerned that those tests pass in the deployment environment (i.e. inside the docker images) than whether they work locally on the given devs environment and we look at addressing how we get poppler inside the deployment environment to a newer version at least close to what we've got in the host Oses (0.71, 0.81) so we can suss out whether we need to change how we handle the text coming out of poppler based on the output we're getting from the newer version above.
After some tests failing when I ran
make test
on my mac computer @ivyleavedtoadflax realised there is an error inreach/pdf_parser/pdf_parser.py
in which the text extraction works as desired for Linux only.Line 89 in pdf_parser.py:
full_text = '\n'.join([_flatten_text(text) for text in tree.xpath('//text')])
e.g. for the test pdf in https://github.com/wellcometrust/reach/blob/master/reach/scraper/tests/pdfs/test_pdf_multipage.pdf, my mac outputs 18 lines:
whereas we want to output the be 10 lines