Model downloading error. Is it possible to download it to user defined path?

Jigyasa3 commented 1 month ago

Hi, thanks again for a great model! I am running the following example code to generate test.esm.embs.pkl file for which the gLM downloads the esm2_t33_650M_UR50D.pt file. But I am running into [Errno 122] Disk quota exceeded error. Is it possible to download the model to a user-defined path which has more storage space?

Code- conda activate glm-env sbatch --partition gpu --gpus 1 --wrap "python /home/jigyasaa/downloads/gLM/data/plm_embed.py /home/jigyasaa/downloads/gLM/data/example_data/inference_example/test.fa /groups/rubin/projects/jigyasa/eCIS/results/gLM_MLmodel/example_data/inference_example/test.esm.embs.pkl"

Error-

Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm2_t33_650M_UR50D.pt" to /home/jigyasaa/.cache/torch/hub/checkpoints/esm2_t33_650M_UR50D.pt
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jigyasaa/.pyenv/versions/3.11.3/lib/python3.11/site-packages/torch/hub.py", line 658, in download_url_to_file
[rank0]:     f.write(buffer)  # type: ignore[possibly-undefined]
[rank0]:     ^^^^^^^^^^^^^^^
[rank0]: OSError: [Errno 122] Disk quota exceeded

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/jigyasaa/downloads/gLM/data/plm_embed.py", line 29, in <module>
[rank0]:     model_data, regression_data = esm.pretrained._download_model_and_regression_data(model_name)
[rank0]:                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jigyasaa/.pyenv/versions/3.11.3/lib/python3.11/site-packages/esm/pretrained.py", line 54, in _download_model_and_regression_data
[rank0]:     model_data = load_hub_workaround(url)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jigyasaa/.pyenv/versions/3.11.3/lib/python3.11/site-packages/esm/pretrained.py", line 33, in load_hub_workaround
[rank0]:     data = torch.hub.load_state_dict_from_url(url, progress=False, map_location="cpu")
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/jigyasaa/.pyenv/versions/3.11.3/lib/python3.11/site-packages/torch/hub.py", line 765, in load_state_dict_from_url
[rank0]:     download_url_to_file(url, cached_file, hash_prefix, progress=progress)
[rank0]:   File "/home/jigyasaa/.pyenv/versions/3.11.3/lib/python3.11/site-packages/torch/hub.py", line 670, in download_url_to_file
[rank0]:     f.close()
[rank0]: OSError: [Errno 122] Disk quota exceeded

riveSunder commented 1 month ago

Hi!

It looks like you're running out of space in ~/.cache/torch/hub, the torch hub directory. You can check and set your hub directory with torch.hub.get_dir() and torch.hub.set_dir("/path/with/space/hub") from within Python.

If you use torch.hub.set_dir and then leave and re-enter Python, you'll notice that the change is not persistent (it will reset to the default value). You can also set the directory torch will use from the command line:

export TORCH_HOME=/path/with/space

Note that torch.hub.set_dir sets the path to the hub directory, but TORCH_HOME refers to the directory that contains hub, one level above.

Setting environmental variables from the command line with export is not persistent either. But you can add the line above to your .bashrc file to set TORCH_HOME every time you open a new shell.

Alternatively, you can use a more complicated command to set TORCH_HOME when activating the environment.

eval "$(conda shell.bash activate glm-env) && export TORCH_HOME=/path/with/space"

conda shell.bash activate glm-env writes a shell script to activate the environment, but does not execute it. You can check the output (the shell script) by running this part on its own.
&& separates the commands in the shell script from export TORCH_HOME, which is the line from earlier, for setting the path to the directory containing hub (where model parameters should be downloaded). You can also use ; instead of &&. && ensures that downstream code isn't executed if something goes wrong with the first part.
eval executes the shell code. You can use echo in place of eval to print the script and inspect the shell code.

If you use the last method, TORCH_HOME will still be set as your custom path after calling conda deactivate, but you can also append the command to unset this variable in the same way as before:

# deactivate conda env and removes enviromental variable TORCH_HOME
eval "$(conda shell.bash deactivate) && unset TORCH_HOME"

At which point your shell should be back to normal (can verify the variable was unset with echo $TORCH_HOME).

I think the options above should allow you to control where model parameters are downloaded and stored but I haven't fully replicated your issue to verify, so let me know if something goes wrong!

Jigyasa3 commented 3 weeks ago

Thank you so much @riveSunder for suggesting an option and explaining it. It works!

y-hwang / gLM

Model downloading error. Is it possible to download it to user defined path? #10