wellcometrust / reach

Wellcome tool to parse references scraped from policy documents using machine learning
MIT License
26 stars 4 forks source link

Integrate multitask splitter+parser model #505

Closed lizgzil closed 4 years ago

lizgzil commented 4 years ago

Description

Note: No need to review anything from reach/ as this will all be deleted after the merge.

Addresses issue https://github.com/wellcometrust/reach/issues/500.

In this PR I replace SplitSection with SplitParser in refparse.py. This means instead of using the deep reference splitter model plus the naive bayes parser model, we now combine splitting and parsing by using the deep reference multitask model.

This has knock-on effects for various other scripts - changing the tests, the init and settings files, the functions in parse.py (now reduced to just one), adding a file (test_config_multitask.ini) which will make the tests on this model a bit quicker than they would be with the default config file.

Note: I make changes to both the scripts in reach/refparse and split_reach/extracter/refparse, which are essentially the same changes for both file locations. In each commit I add each pair of scripts (e.g. reach/refparse/utils/__init__.py and split_reach/extracter/refparse/utils/__init__.py) so in terms of reviewing more easily you might want to go through commit by commit.

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

make test

Checklist:

nsorros commented 4 years ago

This PR moves into a combined split / parse logic but note that this is not necessary. The multi task model does not make joint predictions, just trains on two tasks i.e. multitask. You can still have a separate split and parse method which would make integrating a bit easier.

lizgzil commented 4 years ago

@nsorros šŸ‘ for the review, on your points:

nsorros commented 4 years ago

@nsorros šŸ‘ for the review, on your points:

  • "having a test config file here" - see my response above, I was a bit unsure what you meant
  • "combining split and parse which is not strictly necessary" - As we discussed the new multitask model does indeed make predictions of both tasks, so it is necessary to combine.
  • "remove pool_map which should be avoided or better understood" : I've added an issue #507, but for now the pool_map isn't used for multiprocessing since pool_map = map, so I hope it's ok to address this at another point.

ideally the test file should be in the deep reference end not here. reach should use deep reference parser as a package without needing to define such a config to test.

split and parse could still be used separately, this would mean predicting twice, it would make integrating easier but i understand this is inefficient, just making the point that an easier transition could be made before combining the two.

removing pool_map should be left to devs, it is up to them but it is not as simple as saying that map is used at the moment, as you are breaking an easy way to parallelise the code in case running reach takes longer than needed.

lizgzil commented 4 years ago

@nsorros good point about the testing! Of course it should just be tested in the deep reference parser repo. I've removed these tests and the config (from split_reach/ only since reach/ will be deleted anyway) in this commit https://github.com/wellcometrust/reach/pull/505/commits/c623c5b2f378714bb7cf1e8cdaa5e470a8e32512

nsorros commented 4 years ago

I'm happy with the requirement change, but I'll leave the final approval to @nsorros

could you comment on the removal of pool_map as well?

SamDepardieu commented 4 years ago

@nsorros Sorry, just saw your comment. It's okay to remove it for the time being, but it'd probably be good to put it back at some point, to speed things up. In the first time, we can monitor this to see if it actually needs this