This PR is for the retraining of the MeSH classifier. It's a pretty bulky repository (given its history), but hopefully the changes aren't too bad..
The main additions are:
a script create_inclusion_list.py which effectively excludes terms we do not want to use in our tagging anymore:
tags that are being tagged in the MeSH dictionary with a 'DO NOT USE' attribute. These are MeSH tags that are there because they help catalogue or help with structuring the tree or are contentious such as Male and Female. Theres 200 of these MeSH tags
I also included a file descriptors_not_to_use_manual.csv in data/processed where we can add labels that will be excluded from training. E.g. Kirstin asked me to take out Humans, which is the only term in there so far.
an updated dvc.yaml file which contains three pipelines in one file: one for retraining the MeSH classifier using the above inclusion list, one for retraining the MeSH classifier on terms used by WT (which has slightly better accuracy) and one with toy data, because running any of the two other pipelines takes a while.
Description
This PR is for the retraining of the MeSH classifier. It's a pretty bulky repository (given its history), but hopefully the changes aren't too bad..
The main additions are:
create_inclusion_list.py
which effectively excludes terms we do not want to use in our tagging anymore:descriptors_not_to_use_manual.csv
in data/processed where we can add labels that will be excluded from training. E.g. Kirstin asked me to take out Humans, which is the only term in there so far.dvc.yaml
file which contains three pipelines in one file: one for retraining the MeSH classifier using the above inclusion list, one for retraining the MeSH classifier on terms used by WT (which has slightly better accuracy) and one with toy data, because running any of the two other pipelines takes a while.Checklist
Release checklist
To release:
make build
make deploy