wellcometrust / grants_tagger

Tag grants with MeSH and other tags
MIT License
14 stars 4 forks source link

Add light grants tagger for inference as the default 🪶 #246

Closed pdan93 closed 1 year ago

pdan93 commented 1 year ago

Description

This PR makes the default grants tagger environment light and only responsible to inference. To do that

We also

You can install the default environment and test by running

make virtualenv
make test

Fixes #201

Checklist

Release checklist

nsorros commented 1 year ago

Also tests should pass

nsorros commented 1 year ago

When we get tests to pass and implement a recipe to update both requirements, let's mark the tests according to which dependendencies they need and run the tests that need the dev dependencies by installing the dev virtualenv while running the other with installing the default ones to ensure that the repo will work with the light deps.

AndreiXYZ commented 1 year ago

Build is now passing. Tests are working, including locally via running pytest. @nsorros Please review the changes I made when you have time

AndreiXYZ commented 1 year ago

Build is now passing. Tests are working, including locally via running pytest. @nsorros Please review the changes I made when you have time

nsorros commented 1 year ago

Are those dependencies needed in unpinned_requirements for predict and download to work? If not move to dev.

matplotlib
gensim==4.0.0
scispacy
scikit-multilearn
streamlit
seaborn
nsorros commented 1 year ago

also create two sets of the tests that you run independently using pytest mark. for the light installation run only the predict / download tests and for all other tests run everything. for the light only install the light version

nsorros commented 1 year ago

Also I do not think download or predict works. Can you provide a sample that works?

In download case there is a wrong version in the package but even when using version = "0.2.4" this only downloads xlinear. If anything we should download bertmesh although we can discuss whether that is needed at this point.

In predict case it requires a path to a label binarizer which is not needed for bert mesh. I run grants_tagger predict malaria Wellcome/WellcomeBertMesh models/xlinear/label_binarizer-2022.12.0.pkl so i used xlinear label binarizer but this did not work either.

I suggest

nsorros commented 1 year ago

We should also use this opportunity to simplify the default requirements. Here is the list I used locally that mostly worked

pandas
xlrd
scikit-learn
numpy
transformers
scipy
wasabi
typer
tqdm
requests
openpyxl
torch

this is for unpinned requirements.

nsorros commented 1 year ago

After all that is done, also check which packages are taking most of the space in the virtualenv, say top 5? For me at this point these are

562912  venv/lib/python3.8/site-packages//torch
208728  venv/lib/python3.8/site-packages//scipy
117504  venv/lib/python3.8/site-packages//pandas
117160  venv/lib/python3.8/site-packages//transformers
111576  venv/lib/python3.8/site-packages//numpy

the most heavy is torch which for me is 500MB but for some linux variants gets into the GBs. It would be good to force a cpu installation of torch for inference in order to make this really light 🪶

AndreiXYZ commented 1 year ago

also create two sets of the tests that you run independently using pytest mark. for the light installation run only the predict / download tests and for all other tests run everything. for the light only install the light version

The predict test I decided to skip for now. It was a bit too much hassle to modify it to the new model

AndreiXYZ commented 1 year ago

You can run tests reserved for inference time via: pytest -m inference_time

nsorros commented 1 year ago

Also the cuda libraries are not removed. Add grep -v in the make recipe.

ERROR: Could not find a version that satisfies the requirement nvidia-cublas-cu11==11.10.3.66 (from versions: 0.0.1.dev5, 0.0.1)
ERROR: No matching distribution found for nvidia-cublas-cu11==11.10.3.66
nsorros commented 1 year ago

At the moment the light virtualenv takes 4.8GB in ubuntu. And this is why

2647480 venv/lib/python3.8/site-packages/nvidia
1359556 venv/lib/python3.8/site-packages/torch
188196  venv/lib/python3.8/site-packages/triton
92672   venv/lib/python3.8/site-packages/pydantic
87564   venv/lib/python3.8/site-packages/scipy

If we can force a cpu installation of torch the size will reduce dramatically.

nsorros commented 1 year ago

We are now at 1.4GB size for the default virtualenv which is quite light 🪶 (compared to the almost 6GB)

1424296 venv/lib/python3.8/site-packages/
730120  venv/lib/python3.8/site-packages/torch
92672   venv/lib/python3.8/site-packages/pydantic
87564   venv/lib/python3.8/site-packages/scipy
67852   venv/lib/python3.8/site-packages/sympy
63360   venv/lib/python3.8/site-packages/pandas