Hi, Thanks for sharing the codebase for your work. I am trying to train the network on custom data, but I got the following error:
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [46,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [47,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [48,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [49,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [54,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [55,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [56,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [57,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [58,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [59,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one failed.
0%| | 1/15356 [00:01<5:05:31, 1.19s/it]
Traceback (most recent call last):
File "/media/home/C/NgeNet/train.py", line 224, in
main()
File "/media/home/C/NgeNet/train.py", line 124, in main
loss_dict = model_loss(coords_src=coords_src,
File "/home/anaconda3/envs/Negnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, *kwargs)
File "/media/home/C/NgeNet/losses/loss.py", line 130, in forward
overlap_loss_v = 0.5 self.overlap_loss(ol_scores_src, ol_gt_src) + \
File "/media/home/C/NgeNet/losses/loss.py", line 65, in overlap_loss
weights[ol_gt > 0.5] = 1 - ratio
RuntimeError: CUDA error: device-side assert triggered
Process finished with exit code 1
It seems that during the network training, the weight parameter is too large, cause the variables ‘q_feats_local’ is too large, cause the leaky_relu to be nan, which causes an error in the back propagation of loss. Have you ever encountered this situation during the experiment?
Hi, Thanks for sharing the codebase for your work. I am trying to train the network on custom data, but I got the following error:
/pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [46,0,0] Assertion
main()
File "/media/home/C/NgeNet/train.py", line 124, in main
loss_dict = model_loss(coords_src=coords_src,
File "/home/anaconda3/envs/Negnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, *kwargs)
File "/media/home/C/NgeNet/losses/loss.py", line 130, in forward
overlap_loss_v = 0.5 self.overlap_loss(ol_scores_src, ol_gt_src) + \
File "/media/home/C/NgeNet/losses/loss.py", line 65, in overlap_loss
weights[ol_gt > 0.5] = 1 - ratio
RuntimeError: CUDA error: device-side assert triggered
input_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [47,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [48,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [49,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [50,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [51,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [52,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [53,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [54,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [55,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [56,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [57,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [58,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [59,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [60,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [61,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [62,0,0] Assertioninput_val >= zero && input_val <= one
failed. /pytorch/aten/src/ATen/native/cuda/Loss.cu:102: operator(): block: [1,0,0], thread: [63,0,0] Assertioninput_val >= zero && input_val <= one
failed. 0%| | 1/15356 [00:01<5:05:31, 1.19s/it] Traceback (most recent call last): File "/media/home/C/NgeNet/train.py", line 224, inProcess finished with exit code 1
It seems that during the network training, the weight parameter is too large, cause the variables ‘q_feats_local’ is too large, cause the leaky_relu to be nan, which causes an error in the back propagation of loss. Have you ever encountered this situation during the experiment?