Open likojack opened 4 years ago
THis is from the DCNv2 layers. It might be caused by the cuda version/ gpu type on your running machine mismatched the cuda version/ gpu type when you compiling DCNv2. Please try to re-compile DCNv2 in the same environments.
I can train the network on KITTI with single gpu. However when I added "--gpus 2,3" for multi-gpu training with the full command as follows:
python main.py tracking --exp_id kitti_fulltrain --dataset kitti_tracking --dataset_version train --pre_hm --same_aug --hm_disturb 0.05 --lost_disturb 0.2 -- fp_disturb 0.1 --batch_size 4 --load_model ../models/nuScenes_3Ddetection_e140.pth --gpus 2,3
I got the following error:error in modulated_deformable_im2col_cuda: no kernel image is available for execution on the device Traceback (most recent call last): File "main.py", line 101, in
main(opt)
File "main.py", line 70, in main
log_dicttrain, = trainer.train(epoch, train_loader)
File "/home/kejie/CenterTrack/src/lib/trainer.py", line 317, in train
return self.run_epoch('train', epoch, data_loader)
File "/home/kejie/CenterTrack/src/lib/trainer.py", line 149, in run_epoch
output, loss, loss_stats = model_with_loss(batch)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, *kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, kwargs)
File "/home/kejie/CenterTrack/src/lib/trainer.py", line 98, in forward
outputs = self.model(batch['image'], pre_img, pre_hm)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, kwargs)
File "/home/kejie/CenterTrack/src/lib/model/networks/base_model.py", line 75, in forward
feats = self.imgpre2feats(x, pre_img, pre_hm)
File "/home/kejie/CenterTrack/src/lib/model/networks/dla.py", line 633, in imgpre2feats
x = self.dla_up(x)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, *kwargs)
File "/home/kejie/CenterTrack/src/lib/model/networks/dla.py", line 572, in forward
ida(layers, len(layers) -i - 2, len(layers))
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(input, kwargs)
File "/home/kejie/CenterTrack/src/lib/model/networks/dla.py", line 543, in forward
layers[i] = upsample(project(layers[i]))
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 778, in forward
output_padding, self.groups, self.dilation)
RuntimeError: CUDA error: an illegal memory access was encountered
Any clues?