wellcometrust / deep_reference_parser

A deep learning model for extracting references from text
MIT License
25 stars 1 forks source link

Corrections post release #33

Closed lizgzil closed 4 years ago

lizgzil commented 4 years ago

After the latest release and trying to use it in Reach, I found a few problems/things I should have added pre-release!

As a hacky fix so this release of the model will work, I uploaded indices.pickle and weights.h5 to the location the incorrect config file expects, i.e. https://s3.console.aws.amazon.com/s3/buckets/datalabs-public/deep_reference_parser/data/models/multitask/2020.4.5_multitask/?region=eu-west-2&tab=overview as well as the correct location of https://s3.console.aws.amazon.com/s3/buckets/datalabs-public/deep_reference_parser/models/multitask/2020.4.5_multitask/?region=eu-west-2&tab=overview

NEW COMMITS:

ivyleavedtoadflax commented 4 years ago

Ahh yes this is a gotcha. Basically every time i created a release I copied the data from the s3://datalabs-data/deep_reference_parser/data..etc to s3://datalabs-public/deep_reference_parser - so not a hacky fix at all - that's just how i did. I think it is worth continuing to do this so that there s3://datalabs-public ends up having a canonical store of a few model versions linked to releaseses, while the experiments in the datalabs repo are separate.

codecov-io commented 4 years ago

Codecov Report

Merging #33 into master will decrease coverage by 2.28%. The diff coverage is 64.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #33      +/-   ##
==========================================
- Coverage   83.75%   81.46%   -2.29%     
==========================================
  Files          16       17       +1     
  Lines        1200     1338     +138     
==========================================
+ Hits         1005     1090      +85     
- Misses        195      248      +53     
Impacted Files Coverage Δ
deep_reference_parser/train.py 0.00% <0.00%> (ø)
deep_reference_parser/parse.py 70.11% <36.36%> (-3.96%) :arrow_down:
deep_reference_parser/split.py 67.32% <41.66%> (-3.20%) :arrow_down:
deep_reference_parser/split_parse.py 67.00% <67.00%> (ø)
deep_reference_parser/deep_reference_parser.py 87.40% <75.00%> (-0.48%) :arrow_down:
deep_reference_parser/io/io.py 97.22% <100.00%> (ø)
deep_reference_parser/model_utils.py 94.71% <100.00%> (-0.03%) :arrow_down:
deep_reference_parser/tokens_to_references.py 91.89% <100.00%> (+25.22%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b301d12...e4de5a4. Read the comment docs.

lizgzil commented 4 years ago

@aCampello the splitting references up logic isn't mine, it's just the same as tokens_to_reference in tokens_to_reference.py that Matt wrote, just with a different output. I didn't want to modify the tokens_to_reference function for fear of messing up things downstream, but I will actually slightly refactor in the next commit. I'll add an issue to revisit the logic for the splitting up and try to talk to Matt about why he came to those decisions.

ivyleavedtoadflax commented 4 years ago

that tokens to references logic is pretty naive to be honest, so now that there is both splitting and parsing tokens being output by the model it could probably do with with some attention.

aCampello commented 4 years ago

Looks good - if tests pass and downstream tasks seem good, I am happy to approve - we can address this in a later issue then.