Open ivyleavedtoadflax opened 4 years ago
So I think this is actually part of a larger issue on how the data are formatted. In the must recent versions since the implementation of the multitask model, I fixed an issue in the data prep step so that all the training data will have the same length, both the Rodrigues data and Wellcome's own data, and this seems to have caused some dip in performance. It's very curious, because the best performing multitask model 2020.3.18 was built on Reach data with a sequence length of 250 tokens, but Rodrigues data that was loaded in one continuous list of tokens. I think this means that only one Rodrigues example will be loaded, as the string of tokens will get truncated to 250 by the tokenizer.
I retrained the multitask model with the data created for the 2020.3.18 model, flawed as it may be, and it creates a big jump in model performance.
@lizgzil shall we use this model as default in the meantime, and get to the bottom of the sequence length issues at a later date?
yeh i think that makes sense. Does it kind of make sense that adding lots of Rodrigues data to the training data dips model performance because it was created for a different task?
@ivyleavedtoadflax I see this is the latest deep reference parser wheel in S3: https://s3.console.aws.amazon.com/s3/object/datalabs-public/deep_reference_parser/deep_reference_parser-2020.3.1-py3-none-any.whl?region=eu-west-2
it was uploaded on the 18th march. Does that mean there is no wheel for the latest training of the model (which does use 18th march data, but will be different from the 18th march wheel since it was retrained and bits of the training code have changed since then).
Sorry for dim question, but how do you go about uploaded a new wheel on S3? I'm not sure where this code lives?
yeh i think that makes sense. Does it kind of make sense that adding lots of Rodrigues data to the training data dips model performance because it was created for a different task?
I did wonder...but then when i included no Rodrigues data, it performed worse. I think it probably needs a bit of experimentation.
@ivyleavedtoadflax I see this is the latest deep reference parser wheel in S3:
https://s3.console.aws.amazon.com/s3/object/datalabs-public/deep_reference_parser/deep_reference_parser-2020.3.1-py3-none-any.whl?region=eu-west-2
it was uploaded on the 18th march. Does that mean there is no wheel for the latest training of the model (which does use 18th march data, but will be different from the 18th march wheel since it was retrained and bits of the training code have changed since then). Sorry for dim question, but how do you go about uploaded a new wheel on S3? I'm not sure where this code lives?
I usually create a Makefile recipe like:
.PHONY: dist
dist:
-rm build/bin build/bdist.linux-x86_64 -r
-rm dist/*
-rm src.egg-info -r
$(VIRTUALENV)/bin/python3 setup.py sdist bdist_wheel
$(VIRTUALENV)/bin/python3 setup.py clean
there may already be one in datalabs/../deep_reference_parser/
project folder. Note that I set about deleting anything which might cache previous builds as dist, if not you can end with files from previous versions ending up in the current dist. There is probably a better way!
I think when you integrate with Reach you can probably just point directly at github, not s3://datalabs-public
. When you create a release like these it will package up a tar.gz
, and you can add a wheel here too.
So typically what I did in the past was to create a release, make a wheel locally with make dist
and then upload that wheel manually to github to be stored with the release. There is probably an automated way of doing this, but I have not looked into it. You then get a fixed url for the wheel, and you can download it with pip install https://github.com/wellcometrust/deep_reference_parser/releases/download/2020.3.1/deep_reference_parser-2020.3.1-py3-none-any.whl
or by doing pip install https://www.github.com/wellcometrust/deep_reference_parser.git@master
specifying the release or branch or commit after the @
.
this might be a good way to automatically release and add attributes btw https://github.com/marketplace/actions/automatic-releases
ooh nice!
Note that this issue occurs only in #25, not the master branch. The splitting model implemented in 3e48684e9e769d6bb352bfaafd74f89a8b4779a3 performs badly, e.g.:
It was expected that this model would perform less well than the model implemented in 2020.3.1, however it seems to be worse than expected.
The new model
2020.3.6
is required to ensure compatibility with the changes implemented in #25. Changes to the Rodrigues data format mean that this model runs in less than one hours, instead of around 16 hours.Some experimentation with hyper-parameters is probably all that is needed to bring this model up to scratch, and in any case it is largely superseded by the multitask
split_parse
model. If a high quality splitting model is required immediately, revert to an earlier Pre-release version for now, all of which perform very well.