lizgzil commented 4 years ago

Description

Note: No need to review anything from reach/ as this will all be deleted after the merge.

Addresses issue https://github.com/wellcometrust/reach/issues/500.

In this PR I replace SplitSection with SplitParser in refparse.py. This means instead of using the deep reference splitter model plus the naive bayes parser model, we now combine splitting and parsing by using the deep reference multitask model.

This has knock-on effects for various other scripts - changing the tests, the init and settings files, the functions in parse.py (now reduced to just one), adding a file (test_config_multitask.ini) which will make the tests on this model a bit quicker than they would be with the default config file.

Note: I make changes to both the scripts in reach/refparse and split_reach/extracter/refparse, which are essentially the same changes for both file locations. In each commit I add each pair of scripts (e.g. reach/refparse/utils/__init__.py and split_reach/extracter/refparse/utils/__init__.py) so in terms of reviewing more easily you might want to go through commit by commit.

Type of change

Please delete options that are not relevant.

[ ] :bug: Bug fix (Add Fix #(issue) to your PR)
[x] :sparkles: New feature
[ ] :fire: Breaking change
[ ] :memo: Documentation update

How Has This Been Tested?

make test

Checklist:

[ ] My code follows the style guidelines of this project (pep8 AND pyflakes)
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] If needed, I changed related parts of the documentation
[ ] I included tests in my PR
[ ] New and existing unit tests pass locally with my changes
[ ] Any dependent changes have been merged and published in downstream modules
[ ] If my PR aims to fix an issue, I referenced it using #(issue)

nsorros commented 4 years ago

This PR moves into a combined split / parse logic but note that this is not necessary. The multi task model does not make joint predictions, just trains on two tasks i.e. multitask. You can still have a separate split and parse method which would make integrating a bit easier.

lizgzil commented 4 years ago

@nsorros 👍 for the review, on your points:

"having a test config file here" - see my response above, I was a bit unsure what you meant
"combining split and parse which is not strictly necessary" - As we discussed the new multitask model does indeed make predictions of both tasks, so it is necessary to combine.
"remove pool_map which should be avoided or better understood" : I've added an issue https://github.com/wellcometrust/reach/issues/507, but for now the pool_map isn't used for multiprocessing since pool_map = map, so I hope it's ok to address this at another point.

nsorros commented 4 years ago

@nsorros 👍 for the review, on your points:

"having a test config file here" - see my response above, I was a bit unsure what you meant

"combining split and parse which is not strictly necessary" - As we discussed the new multitask model does indeed make predictions of both tasks, so it is necessary to combine.

"remove pool_map which should be avoided or better understood" : I've added an issue #507, but for now the pool_map isn't used for multiprocessing since pool_map = map, so I hope it's ok to address this at another point.

ideally the test file should be in the deep reference end not here. reach should use deep reference parser as a package without needing to define such a config to test.

split and parse could still be used separately, this would mean predicting twice, it would make integrating easier but i understand this is inefficient, just making the point that an easier transition could be made before combining the two.

removing pool_map should be left to devs, it is up to them but it is not as simple as saying that map is used at the moment, as you are breaking an easy way to parallelise the code in case running reach takes longer than needed.

lizgzil commented 4 years ago

@nsorros good point about the testing! Of course it should just be tested in the deep reference parser repo. I've removed these tests and the config (from split_reach/ only since reach/ will be deleted anyway) in this commit https://github.com/wellcometrust/reach/pull/505/commits/c623c5b2f378714bb7cf1e8cdaa5e470a8e32512

nsorros commented 4 years ago

I'm happy with the requirement change, but I'll leave the final approval to @nsorros

could you comment on the removal of pool_map as well?

SamDepardieu commented 4 years ago

@nsorros Sorry, just saw your comment. It's okay to remove it for the time being, but it'd probably be good to put it back at some point, to speed things up. In the first time, we can monitor this to see if it actually needs this

wellcometrust / reach

Integrate multitask splitter+parser model #505

Description

Type of change

How Has This Been Tested?

Checklist: