ivyleavedtoadflax commented 4 years ago

What this PR contains

Adds support for multitask training and prediction following #13 with a new split_parse command that follows the same logic as the split and parse commands.
Makes load_tsv more robust to quotes in input data.
Combines model artefacts into a single pickle called indices.pickle. Note that this change may cause issues when attempting to load earlier (now outdated) model versions.
Adds handling for incorrect config path in split, parse and split_parse commands.

NOTE: This version includes changes to both the way that model artefacts are packaged and saved, and the way that data are laded and parsed from tsv files. This results in a significantly faster training time (c.14 hours -> c.0.5 hour), but older models will no longer be compatible. For compatibility you must use multitask models > 2020.3.19, splitting models > 2020.3.6, and parisng models > 2020.3.8. These models currently perform less well than previous versions (#27), but performance is expected to improve with experimentation predominantly around sequence length, and more annotated data.

How you can test it

# Create a test file:
cat > references.txt <<EOF
1 Sibbald, A, Eason, W, McAdam, J, and Hislop, A (2001). The establishment phase of a silvopastoral national network experiment in the UK. Agroforestry systems, 39, 39–53. 
2 Silva, J and Rego, F (2003). Root distribution of a Mediterranean shrubland in Portugal. Plant and Soil, 255 (2), 529–540. 
3 Sims, R, Schock, R, Adegbululgbe, A, Fenhann, J, Konstantinaviciute, I, Moomaw, W, Nimir, H, Schlamadinger, B, Torres-Martínez, J, Turner, C, Uchiyama, Y, Vuori, S, Wamukonya, N, and X. Zhang (2007). Energy Supply. In Metz, B, Davidson, O, Bosch, P, Dave, R, and Meyer, L (eds.), Climate Change 2007: Mitigation. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.
EOF

# Test the default model that ships with the package in a new venv

virtualenv temp && source temp/bin/activate
pip install git+git://github.com/wellcometrust/deep_reference_parser.git@feature/ivyleavedtoadflax/multitask_2

# Run the splitter

python -m deep_reference_parser split  "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser parse "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser split_parse "$(cat references.txt)"

codecov-io commented 4 years ago

Codecov Report

Merging #25 into master will not change coverage by %. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #25   +/-   ##
=======================================
  Coverage   80.94%   80.94%           
=======================================
  Files          17       17           
  Lines        1312     1312           
=======================================
  Hits         1062     1062           
  Misses        250      250

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0e6658c...0e6658c. Read the comment docs.

ivyleavedtoadflax commented 4 years ago

@lizgzil I've made some progress on this today. I'm not sure what caused the mismatch in dims, but it seems to be resolved if using a different config/model (2020.3.19 works ok). This may be because of changes in load_tsv made in this PR which are not reflected in 2020.3.18. In any case I think the way that the data were being handled in 2020.3.18 was not optimal - as you point out in https://github.com/wellcometrust/deep_reference_parser/issues/26#issuecomment-607338595.

I would be inclined to forget that model run, and to set the default to be 2020.3.19 for now. It doesn't perform as well as 2020.3.18, but I think with some experimentation you will be able to get the scores up in a future model run using this new logic, and for now it is important just to get it running.

Summary of what I have done today:

https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/3e1b20b8142040c94bf806c7b50002ea87148550 : Whilst investigating the way that weights are loaded, I realised that the keras_contrib.utils.load_save_utils is just a soft wrapper around a keras function. We no longer need the wrapper, so I just use the keras function instead.
https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/20afa75092f882dc927728aff1de7ebe754f973d : Once I realised that the 2020.3.18 model was at fault, the next issue I ran into was that the DeepReferenceParser.predict() method cannot handle multiple outputs. This commit updates this method to handle a list of outputs, rather than a single output.
https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/77971e5ff937f5f10cb49b6c930cffef1fa517da : This is WIP. This commit updates the split_parse to use the new outputs which are in a list, and to pretty print them. The ability to write to a file, and to extract a json as we discussed with author, title, year for each reference is still lacking. You can see the results of this commit with the 2020.3.19 config with:
```
python -m deep_reference_parser split_parse -tc configs/multitask/2020.3.19_multitask.ini "1. Upson, M. A. (2019). Agroforetsry is the best. Journal of Agroforestry. Vol 1. Issue 23. p123-679. 2. Upson, M. A. (2018). Another paper on Agroforestry. Geoderma. 1(3)12-56"
```
The tests are now failing for the simple parse and split cases. This is probably due to changes in the predict method - they probably need a bit of tweaking to ensure they still work for the single task model. Still...some progress!

lizgzil commented 4 years ago

@ivyleavedtoadflax woop! great the mismatch issue isn't happening :) Classic that you fix something and then that causes the older things to break eh!

So do you think once this PR is closed, then I should train a model using the config for 2020.3.18 (i.e. use adam)? But for now I will just use the 2020.3.19 model if I need to.

ivyleavedtoadflax commented 4 years ago

Yes, I would train a new model entirely. I've been doing some experimenting, and what is having more of an effect than anything is the sequence length. In this PR I added some logic to give us more fine grained control of sequence length both in data generation and in the model itself, and it had affected performance, so I think we might need to do a bit of experimentation to work out the best settings.

But for now, yes I would just use a model that works, to get the logic working.

ivyleavedtoadflax commented 4 years ago

Hey @lizgzil spent a little more time on this today:

https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/b33c2b25f37baf8765018c88adccc456018afec4 : Added some logic to give more helpful error handling when you pass the model an incorrect config path.
https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/5b1758711297eb7c6afd356bf03cf8a800010d59 and https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/15306b49f6089df6246ba294feb9912ce6d884fc : Add logic to fix failing tests for the split and parse commands. Essentially we just needed to subset the output from deep_reference_parser.predict which was now a list of multiple outputs.
https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/fdfe5d3562e46df03f5369919cf8cd17ad9ad9b7 and https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/1be386430841981eabcee829a3edf179e89573a1 : Switch to the 2020.3.19_multitask model as the default model.
https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/88f1a241360624551777ad409f41d66e17d791a1 : Merged in your PR adjusting the Makefile reciples to download the 2020.3.19 data.
https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/8e3a155648fe189bf916db18138f768337b3c7e4 and https://github.com/wellcometrust/deep_reference_parser/pull/25/commits/ad6d6bba95ff2c4af90229d8f10ef486f6d03d99 : Adds tests for split_parse endpoint. Note that both the endpoint and the tests are currently not complete, functionality to return the parsed reference texts and attributes (authors, years, titles) is not yet implemented.

I think now that tests are passing, I'd be inclined to merge this in even though the split_parse command is not 100% complete, and then add the final functionality in a new smaller PR. There's a fair bit of refactoring that could be done too, and improving test coverage, which I think are best done in a future PR.

ivyleavedtoadflax commented 4 years ago

should be data/multitask I assume?

:facepalm: yep - will fix

wellcometrust / deep_reference_parser

Implement multitask training #25

What this PR contains

How you can test it

Codecov Report