wellcometrust / deep_reference_parser

A deep learning model for extracting references from text
MIT License
24 stars 1 forks source link

Implement multitask training #25

Closed ivyleavedtoadflax closed 4 years ago

ivyleavedtoadflax commented 4 years ago

What this PR contains

NOTE: This version includes changes to both the way that model artefacts are packaged and saved, and the way that data are laded and parsed from tsv files. This results in a significantly faster training time (c.14 hours -> c.0.5 hour), but older models will no longer be compatible. For compatibility you must use multitask models > 2020.3.19, splitting models > 2020.3.6, and parisng models > 2020.3.8. These models currently perform less well than previous versions (#27), but performance is expected to improve with experimentation predominantly around sequence length, and more annotated data.

How you can test it

# Create a test file:
cat > references.txt <<EOF
1 Sibbald, A, Eason, W, McAdam, J, and Hislop, A (2001). The establishment phase of a silvopastoral national network experiment in the UK. Agroforestry systems, 39, 39–53. 
2 Silva, J and Rego, F (2003). Root distribution of a Mediterranean shrubland in Portugal. Plant and Soil, 255 (2), 529–540. 
3 Sims, R, Schock, R, Adegbululgbe, A, Fenhann, J, Konstantinaviciute, I, Moomaw, W, Nimir, H, Schlamadinger, B, Torres-Martínez, J, Turner, C, Uchiyama, Y, Vuori, S, Wamukonya, N, and X. Zhang (2007). Energy Supply. In Metz, B, Davidson, O, Bosch, P, Dave, R, and Meyer, L (eds.), Climate Change 2007: Mitigation. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.
EOF

# Test the default model that ships with the package in a new venv

virtualenv temp && source temp/bin/activate
pip install git+git://github.com/wellcometrust/deep_reference_parser.git@feature/ivyleavedtoadflax/multitask_2

# Run the splitter

python -m deep_reference_parser split  "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser parse "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser split_parse "$(cat references.txt)"
codecov-io commented 4 years ago

Codecov Report

Merging #25 into master will not change coverage by %. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #25   +/-   ##
=======================================
  Coverage   80.94%   80.94%           
=======================================
  Files          17       17           
  Lines        1312     1312           
=======================================
  Hits         1062     1062           
  Misses        250      250           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0e6658c...0e6658c. Read the comment docs.

ivyleavedtoadflax commented 4 years ago

@lizgzil I've made some progress on this today. I'm not sure what caused the mismatch in dims, but it seems to be resolved if using a different config/model (2020.3.19 works ok). This may be because of changes in load_tsv made in this PR which are not reflected in 2020.3.18. In any case I think the way that the data were being handled in 2020.3.18 was not optimal - as you point out in https://github.com/wellcometrust/deep_reference_parser/issues/26#issuecomment-607338595.

I would be inclined to forget that model run, and to set the default to be 2020.3.19 for now. It doesn't perform as well as 2020.3.18, but I think with some experimentation you will be able to get the scores up in a future model run using this new logic, and for now it is important just to get it running.

Summary of what I have done today:

lizgzil commented 4 years ago

@ivyleavedtoadflax woop! great the mismatch issue isn't happening :) Classic that you fix something and then that causes the older things to break eh!

So do you think once this PR is closed, then I should train a model using the config for 2020.3.18 (i.e. use adam)? But for now I will just use the 2020.3.19 model if I need to.

ivyleavedtoadflax commented 4 years ago

Yes, I would train a new model entirely. I've been doing some experimenting, and what is having more of an effect than anything is the sequence length. In this PR I added some logic to give us more fine grained control of sequence length both in data generation and in the model itself, and it had affected performance, so I think we might need to do a bit of experimentation to work out the best settings.

But for now, yes I would just use a model that works, to get the logic working.

ivyleavedtoadflax commented 4 years ago

Hey @lizgzil spent a little more time on this today:

I think now that tests are passing, I'd be inclined to merge this in even though the split_parse command is not 100% complete, and then add the final functionality in a new smaller PR. There's a fair bit of refactoring that could be done too, and improving test coverage, which I think are best done in a future PR.

ivyleavedtoadflax commented 4 years ago

should be data/multitask I assume?

:facepalm: yep - will fix