Problem with multi-gpus

SMohammadi89 commented 2 years ago

Hi,

Thanks for your great work. I would like to train the model using multiple GPUs but I receive this error: " RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect."

by running this code: CUDA_VISIBLE_DEVICES=0,1 singularity exec --nv --writable-tmpfs -B /work/myname/ /work/myname/pointr.sif bash ./scripts/dist_train.sh 2 13232 --config ./cfgs/PCN_models/PoinTr.yaml --exp_name example

Note that I do not have any problem when using single gpu

yuxumin commented 2 years ago

Hi, can you provide the more details about your issue, like logs, cuda version, number of gpus on your server ...

SMohammadi89 commented 2 years ago

this is the complete error, cuda version is 10.2 and I have 4 GPUs tesla v100

File "main.py", line 68, in main() File "main.py", line 64, in main run_net(args, config, train_writer, val_writer) File "/work/semohammadi/PoinTr/tools/runner.py", line 26, in run_net base_model.to(args.local_rank) File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 899, in to return self._apply(convert) File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply module._apply(fn) File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply module._apply(fn) File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 570, in _apply module._apply(fn) File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 593, in _apply param_applied = fn(param) File "/home/semohammadi/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 897, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

yuxumin / PoinTr

Problem with multi-gpus #31