AtiqEmenent commented 1 year ago

I am trying to train your model on Google colab uisng following command: !python -m torch.distributed.launch --nproc_per_node=1 train_bit.py --config ./configs/bit++_rbi.yaml But i get following error (most probably regarding some issue related to the GPU available on Colab):

`ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7212) of binary: /usr/local/bin/python Traceback (most recent call last): File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 193, in main() File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_bit.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-06-19_09:52:32 host : aea723e3180b rank : 0 (local_rank: 0) exitcode : 1 (pid: 7212) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================` As I dont know about the backend settings needed to run this code regarding GPU, can you please guide me how to solve this error

patsun commented 1 year ago

I have similar error when using inference in a linux computer/local gpu. Do you have a solution yet? Thank you!

/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( Warning! No module named 'sounddevice' Warning! No module named 'keras' local_rank: 0 Traceback (most recent call last): File "./tools/inference/inference.py", line 59, in torch.cuda.set_device(local_rank) File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/cuda/init.py", line 314, in set_device torch._C._cuda_setDevice(device) File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/cuda/init.py", line 217, in _lazy_init torch._C._cuda_init() RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 53740) of binary: /home/riselab/anaconda3/envs/BiT/bin/python Traceback (most recent call last): File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/riselab/anaconda3/envs/BiT/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/inference/inference.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-07-11_07:48:34 host : riselab rank : 0 (local_rank: 0) exitcode : 1 (pid: 53740) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

zzh-tech commented 1 year ago

I just installed env as instructed and tested the code on a new machine with a V100. Everything seems to be working fine.

patsun commented 1 year ago

Thank you! It turned out to be an driver issue.

zzh-tech / BiT

Issue When I try to train Train BiT++ on Colab #6

train_bit.py FAILED

./tools/inference/inference.py FAILED