xingyizhou / CenterTrack

Simultaneous object detection and tracking using center points.
MIT License
2.38k stars 526 forks source link

Error with multi gpu training #38

Open likojack opened 4 years ago

likojack commented 4 years ago

I can train the network on KITTI with single gpu. However when I added "--gpus 2,3" for multi-gpu training with the full command as follows: python main.py tracking --exp_id kitti_fulltrain --dataset kitti_tracking --dataset_version train --pre_hm --same_aug --hm_disturb 0.05 --lost_disturb 0.2 -- fp_disturb 0.1 --batch_size 4 --load_model ../models/nuScenes_3Ddetection_e140.pth --gpus 2,3 I got the following error:

error in modulated_deformable_im2col_cuda: no kernel image is available for execution on the device Traceback (most recent call last): File "main.py", line 101, in main(opt) File "main.py", line 70, in main log_dicttrain, = trainer.train(epoch, train_loader) File "/home/kejie/CenterTrack/src/lib/trainer.py", line 317, in train return self.run_epoch('train', epoch, data_loader) File "/home/kejie/CenterTrack/src/lib/trainer.py", line 149, in run_epoch output, loss, loss_stats = model_with_loss(batch) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/kejie/CenterTrack/src/lib/trainer.py", line 98, in forward outputs = self.model(batch['image'], pre_img, pre_hm) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/kejie/CenterTrack/src/lib/model/networks/base_model.py", line 75, in forward feats = self.imgpre2feats(x, pre_img, pre_hm) File "/home/kejie/CenterTrack/src/lib/model/networks/dla.py", line 633, in imgpre2feats x = self.dla_up(x) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/kejie/CenterTrack/src/lib/model/networks/dla.py", line 572, in forward ida(layers, len(layers) -i - 2, len(layers)) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/kejie/CenterTrack/src/lib/model/networks/dla.py", line 543, in forward layers[i] = upsample(project(layers[i])) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 778, in forward output_padding, self.groups, self.dilation) RuntimeError: CUDA error: an illegal memory access was encountered

Any clues?

xingyizhou commented 4 years ago

THis is from the DCNv2 layers. It might be caused by the cuda version/ gpu type on your running machine mismatched the cuda version/ gpu type when you compiling DCNv2. Please try to re-compile DCNv2 in the same environments.