wellcometrust / deep_reference_parser

A deep learning model for extracting references from text
MIT License
25 stars 1 forks source link

Add parsing model #10

Closed ivyleavedtoadflax closed 4 years ago

ivyleavedtoadflax commented 4 years ago

What does this PR contain

How can you test it

# Create a test file:

cat > references.txt <<EOF
1 Sibbald, A, Eason, W, McAdam, J, and Hislop, A (2001). The establishment phase of a silvopastoral national network experiment in the UK. Agroforestry systems, 39, 39–53. 
2 Silva, J and Rego, F (2003). Root distribution of a Mediterranean shrubland in Portugal. Plant and Soil, 255 (2), 529–540. 
3 Sims, R, Schock, R, Adegbululgbe, A, Fenhann, J, Konstantinaviciute, I, Moomaw, W, Nimir, H, Schlamadinger, B, Torres-Martínez, J, Turner, C, Uchiyama, Y, Vuori, S, Wamukonya, N, and X. Zhang (2007). Energy Supply. In Metz, B, Davidson, O, Bosch, P, Dave, R, and Meyer, L (eds.), Climate Change 2007: Mitigation. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.
EOF

# Test the default model that ships with the package in a new venv

virtualenv temp && source temp/bin/activate
pip install git+git://github.com/wellcometrust/deep_reference_parser.git@feature/ivyleavedtoadflax/parsing

# Run the splitter

python -m deep_reference_parser split  "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser parse "$(cat references.txt)"
nsorros commented 4 years ago

virtualenv temp && source temp/bin/activate should be virtualenv -p python3 temp && source temp/bin/activate for those of us that have not updated their default python to 3

nsorros commented 4 years ago

What does this PR contain

  • Adds the python -m deep_reference_parser parse command
  • Renames the predict command to python -m deep_reference_parser split
  • Bumps version to 2020.3.0
  • Improves clarity of CLI output by suppressing DeprecationWarnings caused by using keras and CRF from keras_contrib. It is currently non-trivial to fix these warnings.

How can you test it

# Create a test file:

cat > references.txt <<EOF
1 Sibbald, A, Eason, W, McAdam, J, and Hislop, A (2001). The establishment phase of a silvopastoral national network experiment in the UK. Agroforestry systems, 39, 39–53. 
2 Silva, J and Rego, F (2003). Root distribution of a Mediterranean shrubland in Portugal. Plant and Soil, 255 (2), 529–540. 
3 Sims, R, Schock, R, Adegbululgbe, A, Fenhann, J, Konstantinaviciute, I, Moomaw, W, Nimir, H, Schlamadinger, B, Torres-Martínez, J, Turner, C, Uchiyama, Y, Vuori, S, Wamukonya, N, and X. Zhang (2007). Energy Supply. In Metz, B, Davidson, O, Bosch, P, Dave, R, and Meyer, L (eds.), Climate Change 2007: Mitigation. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.
EOF

# Test the default model that ships with the package in a new venv

virtualenv temp && source temp/bin/activate
pip install git+git://github.com/wellcometrust/deep_reference_parser.git@feature/ivyleavedtoadflax/parsing

# Run the splitter

python -m deep_reference_parser split  "$(cat references.txt)"

# Run the parser

python -m deep_reference_parser parse "$(cat references.txt)"

Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user. Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line. Also for parse maybe a JSON or at least the tokens grouped in their categories?

ivyleavedtoadflax commented 4 years ago

Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user.

Weights are cached, it will only download if you don't have them in the right place which is specified in the config you are using. If you didn't specify a config, it uses the default ones which ship with the package. Could certainly be more clear about where it is looking for these files though.

Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line.

This is the default verbose output. For tokens if you specify --output <file> you will get one token per line. Let me check about references though, ideally this would be one per line too, but I can't remember if I implemented it

More generally I'm strongly considering moving away from the tsv format used by Rodrigues and presenting labels and tokens in separate files with one token/label per line. This would remove a lot of duplicated code.

Also for parse maybe a JSON or at least the tokens grouped in their categories?

Yes agree that output is sub-optimal. I'd been putting this off until the multi-task model was ready so that I could combine the splitter output with the parser output, so that we will have an idea of where a references starts and finishes.

ivyleavedtoadflax commented 4 years ago

virtualenv temp && source temp/bin/activate should be virtualenv -p python3 temp && source temp/bin/activate for those of us that have not updated their default python to 3

:+1:

ivyleavedtoadflax commented 4 years ago

Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line.

This is the default verbose output. For tokens if you specify --output <file> you will get one token per line. Let me check about references though, ideally this would be one per line too, but I can't remember if I implemented it

Ah ok yes I remember now. So the output when --outfile is specified the file written is a valid json. Because the tokens need to be a json (they have both label and token) I'd like to keep the references in json too until we come to a decision about the tsv question. If we decide to split tokens and labels into two separate files (which is what this project does btw - which was used in Rodrigues' implementation) then lets move to a line delimited text file format for both references, tokens, and labels

ivyleavedtoadflax commented 4 years ago

Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user.

Weights are cached, it will only download if you don't have them in the right place which is specified in the config you are using. If you didn't specify a config, it uses the default ones which ship with the package. Could certainly be more clear about where it is looking for these files though.

I've added some additional feedback on where it will look for the artefacts:

ℹ Using config file:                                   
/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/configs/2020.3.2_parsing.ini
ℹ Attempting to download model artefacts if they are not found locally                                                                                
in models/parsing/2020.3.2_parsing/. This may take some time...                                                 
✔ Found models/parsing/2020.3.2_parsing/char2ind.pickle                                                                                               
✔ Found models/parsing/2020.3.2_parsing/ind2label.pickle                           
✔ Found models/parsing/2020.3.2_parsing/ind2word.pickle                                                                                               
✔ Found models/parsing/2020.3.2_parsing/label2ind.pickle                                                         
✔ Found models/parsing/2020.3.2_parsing/maxes.pickle                                                                                                  
✔ Found models/parsing/2020.3.2_parsing/weights.h5                                 
✔ Found models/parsing/2020.3.2_parsing/word2ind.pickle                                                                                               
✔ Found embeddings/2020.1.1-wellcome-embeddings-300.txt