Closed AtiqEmenent closed 1 year ago
I have similar error when using inference in a linux computer/local gpu. Do you have a solution yet? Thank you!
/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
Failures:
I just installed env as instructed and tested the code on a new machine with a V100. Everything seems to be working fine.
Thank you! It turned out to be an driver issue.
I am trying to train your model on Google colab uisng following command: !python -m torch.distributed.launch --nproc_per_node=1 train_bit.py --config ./configs/bit++_rbi.yaml But i get following error (most probably regarding some issue related to the GPU available on Colab):
`ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7212) of binary: /usr/local/bin/python Traceback (most recent call last): File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_bit.py FAILED
Failures: