Closed ivyleavedtoadflax closed 4 years ago
virtualenv temp && source temp/bin/activate
should be virtualenv -p python3 temp && source temp/bin/activate
for those of us that have not updated their default python to 3
What does this PR contain
- Adds the
python -m deep_reference_parser parse
command- Renames the predict command to
python -m deep_reference_parser split
- Bumps version to 2020.3.0
- Improves clarity of CLI output by suppressing
DeprecationWarning
s caused by using keras andCRF
fromkeras_contrib
. It is currently non-trivial to fix these warnings.How can you test it
# Create a test file: cat > references.txt <<EOF 1 Sibbald, A, Eason, W, McAdam, J, and Hislop, A (2001). The establishment phase of a silvopastoral national network experiment in the UK. Agroforestry systems, 39, 39–53. 2 Silva, J and Rego, F (2003). Root distribution of a Mediterranean shrubland in Portugal. Plant and Soil, 255 (2), 529–540. 3 Sims, R, Schock, R, Adegbululgbe, A, Fenhann, J, Konstantinaviciute, I, Moomaw, W, Nimir, H, Schlamadinger, B, Torres-Martínez, J, Turner, C, Uchiyama, Y, Vuori, S, Wamukonya, N, and X. Zhang (2007). Energy Supply. In Metz, B, Davidson, O, Bosch, P, Dave, R, and Meyer, L (eds.), Climate Change 2007: Mitigation. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA. EOF # Test the default model that ships with the package in a new venv virtualenv temp && source temp/bin/activate pip install git+git://github.com/wellcometrust/deep_reference_parser.git@feature/ivyleavedtoadflax/parsing # Run the splitter python -m deep_reference_parser split "$(cat references.txt)" # Run the parser python -m deep_reference_parser parse "$(cat references.txt)"
Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user. Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line. Also for parse maybe a JSON or at least the tokens grouped in their categories?
Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user.
Weights are cached, it will only download if you don't have them in the right place which is specified in the config you are using. If you didn't specify a config, it uses the default ones which ship with the package. Could certainly be more clear about where it is looking for these files though.
Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line.
This is the default verbose output. For tokens if you specify --output <file>
you will get one token per line. Let me check about references though, ideally this would be one per line too, but I can't remember if I implemented it
More generally I'm strongly considering moving away from the tsv format used by Rodrigues and presenting labels and tokens in separate files with one token/label per line. This would remove a lot of duplicated code.
Also for parse maybe a JSON or at least the tokens grouped in their categories?
Yes agree that output is sub-optimal. I'd been putting this off until the multi-task model was ready so that I could combine the splitter output with the parser output, so that we will have an idea of where a references starts and finishes.
virtualenv temp && source temp/bin/activate
should bevirtualenv -p python3 temp && source temp/bin/activate
for those of us that have not updated their default python to 3
:+1:
Also the output of the split seems good for a demo but I wonder whether the outputs should just be one reference per line.
This is the default verbose output. For tokens if you specify
--output <file>
you will get one token per line. Let me check about references though, ideally this would be one per line too, but I can't remember if I implemented it
Ah ok yes I remember now. So the output when --outfile
is specified the file written is a valid json. Because the tokens need to be a json (they have both label and token) I'd like to keep the references in json too until we come to a decision about the tsv
question. If we decide to split tokens and labels into two separate files (which is what this project does btw - which was used in Rodrigues' implementation) then lets move to a line delimited text file format for both references, tokens, and labels
Running the code takes a while since it seems to download weights every time. Any way to cache them? in a location that is made clear to the user.
Weights are cached, it will only download if you don't have them in the right place which is specified in the config you are using. If you didn't specify a config, it uses the default ones which ship with the package. Could certainly be more clear about where it is looking for these files though.
I've added some additional feedback on where it will look for the artefacts:
ℹ Using config file:
/home/matthew/Documents/wellcome/deep_reference_parser/deep_reference_parser/configs/2020.3.2_parsing.ini
ℹ Attempting to download model artefacts if they are not found locally
in models/parsing/2020.3.2_parsing/. This may take some time...
✔ Found models/parsing/2020.3.2_parsing/char2ind.pickle
✔ Found models/parsing/2020.3.2_parsing/ind2label.pickle
✔ Found models/parsing/2020.3.2_parsing/ind2word.pickle
✔ Found models/parsing/2020.3.2_parsing/label2ind.pickle
✔ Found models/parsing/2020.3.2_parsing/maxes.pickle
✔ Found models/parsing/2020.3.2_parsing/weights.h5
✔ Found models/parsing/2020.3.2_parsing/word2ind.pickle
✔ Found embeddings/2020.1.1-wellcome-embeddings-300.txt
What does this PR contain
python -m deep_reference_parser parse
commandpython -m deep_reference_parser split
DeprecationWarning
s caused by using keras andCRF
fromkeras_contrib
. It is currently non-trivial to fix these warnings.How can you test it