Add light grants tagger for inference as the default 🪶

pdan93 commented 1 year ago

Description

This PR makes the default grants tagger environment light and only responsible to inference. To do that

we split the dependencies into dev and default
we introduce another command for updating the dev requirements which is make update-requirements-dev
we split tests into default and dev and run them with separate deps to ensure they work on their respective installations
remove xlinear and cnn models and making wellcome bert mesh the default.
make use of the cpu installation of torch that reduces the size of virtualenv from almost 6GB to 1.4GB 🔥

We also

remove the explain command which only worked for the cnn model
remove download model command which only made sense for xlinear

You can install the default environment and test by running

make virtualenv
make test

Fixes #201

Checklist

[ ] Linked to Notion or GitHub issue
[ ] Added tests
[ ] Updated README
[ ] DVC up to date

Release checklist

[ ] DVC repro up to date
[ ] Models synced in S3
[ ] Version updated
[ ] CHANGELOG.md updated
[ ] Model and package version aligned

nsorros commented 1 year ago

Also tests should pass

nsorros commented 1 year ago

When we get tests to pass and implement a recipe to update both requirements, let's mark the tests according to which dependendencies they need and run the tests that need the dev dependencies by installing the dev virtualenv while running the other with installing the default ones to ensure that the repo will work with the light deps.

AndreiXYZ commented 1 year ago

Build is now passing. Tests are working, including locally via running pytest. @nsorros Please review the changes I made when you have time

AndreiXYZ commented 1 year ago

Build is now passing. Tests are working, including locally via running pytest. @nsorros Please review the changes I made when you have time

nsorros commented 1 year ago

Are those dependencies needed in unpinned_requirements for predict and download to work? If not move to dev.

matplotlib
gensim==4.0.0
scispacy
scikit-multilearn
streamlit
seaborn

nsorros commented 1 year ago

also create two sets of the tests that you run independently using pytest mark. for the light installation run only the predict / download tests and for all other tests run everything. for the light only install the light version

nsorros commented 1 year ago

Also I do not think download or predict works. Can you provide a sample that works?

In download case there is a wrong version in the package but even when using version = "0.2.4" this only downloads xlinear. If anything we should download bertmesh although we can discuss whether that is needed at this point.

In predict case it requires a path to a label binarizer which is not needed for bert mesh. I run grants_tagger predict malaria Wellcome/WellcomeBertMesh models/xlinear/label_binarizer-2022.12.0.pkl so i used xlinear label binarizer but this did not work either.

I suggest

we remove the download model since it is not needed anymore since the production model is a transformer and coming from the hub. we make download to default to epmc so we remove the subcommand and update the documentation.
we make predict to work only with wellcome bert mesh which is our production model so we remove the need for label binarizer and we update the documentation

nsorros commented 1 year ago

We should also use this opportunity to simplify the default requirements. Here is the list I used locally that mostly worked

pandas
xlrd
scikit-learn
numpy
transformers
scipy
wasabi
typer
tqdm
requests
openpyxl
torch

this is for unpinned requirements.

nsorros commented 1 year ago

After all that is done, also check which packages are taking most of the space in the virtualenv, say top 5? For me at this point these are

562912  venv/lib/python3.8/site-packages//torch
208728  venv/lib/python3.8/site-packages//scipy
117504  venv/lib/python3.8/site-packages//pandas
117160  venv/lib/python3.8/site-packages//transformers
111576  venv/lib/python3.8/site-packages//numpy

the most heavy is torch which for me is 500MB but for some linux variants gets into the GBs. It would be good to force a cpu installation of torch for inference in order to make this really light 🪶

AndreiXYZ commented 1 year ago

also create two sets of the tests that you run independently using pytest mark. for the light installation run only the predict / download tests and for all other tests run everything. for the light only install the light version

The predict test I decided to skip for now. It was a bit too much hassle to modify it to the new model

AndreiXYZ commented 1 year ago

You can run tests reserved for inference time via: pytest -m inference_time

nsorros commented 1 year ago

Also the cuda libraries are not removed. Add grep -v in the make recipe.

ERROR: Could not find a version that satisfies the requirement nvidia-cublas-cu11==11.10.3.66 (from versions: 0.0.1.dev5, 0.0.1)
ERROR: No matching distribution found for nvidia-cublas-cu11==11.10.3.66

nsorros commented 1 year ago

At the moment the light virtualenv takes 4.8GB in ubuntu. And this is why

2647480 venv/lib/python3.8/site-packages/nvidia
1359556 venv/lib/python3.8/site-packages/torch
188196  venv/lib/python3.8/site-packages/triton
92672   venv/lib/python3.8/site-packages/pydantic
87564   venv/lib/python3.8/site-packages/scipy

If we can force a cpu installation of torch the size will reduce dramatically.

nsorros commented 1 year ago

We are now at 1.4GB size for the default virtualenv which is quite light 🪶 (compared to the almost 6GB)

1424296 venv/lib/python3.8/site-packages/
730120  venv/lib/python3.8/site-packages/torch
92672   venv/lib/python3.8/site-packages/pydantic
87564   venv/lib/python3.8/site-packages/scipy
67852   venv/lib/python3.8/site-packages/sympy
63360   venv/lib/python3.8/site-packages/pandas

wellcometrust / grants_tagger