Closed lizgzil closed 4 years ago
Ahh yes this is a gotcha. Basically every time i created a release I copied the data from the s3://datalabs-data/deep_reference_parser/data..etc to s3://datalabs-public/deep_reference_parser - so not a hacky fix at all - that's just how i did. I think it is worth continuing to do this so that there s3://datalabs-public ends up having a canonical store of a few model versions linked to releaseses, while the experiments in the datalabs repo are separate.
Merging #33 into master will decrease coverage by
2.28%
. The diff coverage is64.22%
.
@@ Coverage Diff @@
## master #33 +/- ##
==========================================
- Coverage 83.75% 81.46% -2.29%
==========================================
Files 16 17 +1
Lines 1200 1338 +138
==========================================
+ Hits 1005 1090 +85
- Misses 195 248 +53
Impacted Files | Coverage Δ | |
---|---|---|
deep_reference_parser/train.py | 0.00% <0.00%> (ø) |
|
deep_reference_parser/parse.py | 70.11% <36.36%> (-3.96%) |
:arrow_down: |
deep_reference_parser/split.py | 67.32% <41.66%> (-3.20%) |
:arrow_down: |
deep_reference_parser/split_parse.py | 67.00% <67.00%> (ø) |
|
deep_reference_parser/deep_reference_parser.py | 87.40% <75.00%> (-0.48%) |
:arrow_down: |
deep_reference_parser/io/io.py | 97.22% <100.00%> (ø) |
|
deep_reference_parser/model_utils.py | 94.71% <100.00%> (-0.03%) |
:arrow_down: |
deep_reference_parser/tokens_to_references.py | 91.89% <100.00%> (+25.22%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update b301d12...e4de5a4. Read the comment docs.
@aCampello the splitting references up logic isn't mine, it's just the same as tokens_to_reference
in tokens_to_reference.py
that Matt wrote, just with a different output. I didn't want to modify the tokens_to_reference function for fear of messing up things downstream, but I will actually slightly refactor in the next commit. I'll add an issue to revisit the logic for the splitting up and try to talk to Matt about why he came to those decisions.
that tokens to references logic is pretty naive to be honest, so now that there is both splitting and parsing tokens being output by the model it could probably do with with some attention.
Looks good - if tests pass and downstream tasks seem good, I am happy to approve - we can address this in a later issue then.
After the latest release and trying to use it in Reach, I found a few problems/things I should have added pre-release!
As a hacky fix so this release of the model will work, I uploaded
indices.pickle
andweights.h5
to the location the incorrect config file expects, i.e.https://s3.console.aws.amazon.com/s3/buckets/datalabs-public/deep_reference_parser/data/models/multitask/2020.4.5_multitask/?region=eu-west-2&tab=overview
as well as the correct location ofhttps://s3.console.aws.amazon.com/s3/buckets/datalabs-public/deep_reference_parser/models/multitask/2020.4.5_multitask/?region=eu-west-2&tab=overview
NEW COMMITS:
tokens_to_reference_lists
which is liketokens_to_reference
but gives you the components as well as the reference string ([[(token, attribute), ... , (token, attribute)], ..., [(token, attribute), ...]]
)return_tokens=False
output insplit_parse.py
usingtokens_to_reference_lists
2020.4.5_multitask.ini
(I made a small version of the embeddings 300 file for the test)