zaiweizhang / H3DNet

MIT License
211 stars 25 forks source link

training error #4

Closed mmahdavian closed 4 years ago

mmahdavian commented 4 years ago

Hi

I am getting following errors when starts training. Do you know what might be the reason?!

Ubuntu16 pytorch 1.1 tensorflow-gpu=1.14 cuda=10 cudnn=7.4 GPU=RTX2080ti

Thank You

**** EPOCH 000 ****
Current learning rate: 0.001000
Current BN decay momentum: 0.500000
2020-08-02 19:38:48.510935
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=383 error=11 : invalid argument
Traceback (most recent call last):
  File "train.py", line 382, in <module>
    train(start_epoch)
  File "train.py", line 361, in train
    train_one_epoch()
  File "train.py", line 257, in train_one_epoch
    end_points = net(inputs, end_points)            
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sahar/Mohammad_ws/H3DNet/models/hdnet.py", line 185, in forward
    end_points = self.pnet_final(proposal_xyz, proposal_features, center_z, feature_z, center_xy, feature_xy, center_line, feature_line, end_points)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/sahar/Mohammad_ws/H3DNet/models/proposal_module_refine.py", line 276, in forward
    obj_surface_center, obj_line_center = get_surface_line_points_batch_pytorch(obj_size, pred_heading, obj_center)
  File "/home/sahar/Mohammad_ws/H3DNet/utils/box_util.py", line 353, in get_surface_line_points_batch_pytorch
    surface_3d = torch.matmul(surface_3d.unsqueeze(-2), surface_rot.transpose(3,2)).squeeze(-2)
RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:450
zaiweizhang commented 4 years ago

I searched that error online, and I think it might be a specific issue for RTX 2080. https://github.com/pytorch/pytorch/issues/17334

I have not met this problem before. Can you double check if you are fully running on CUDA10? Also, it looks like the problem will randomly appear. If you run it multiple times, do you always get it on zero epoch and do you always get it on that line?

mmahdavian commented 4 years ago

Thank You for the help