RuntimeError: CUDA error: device-side assert triggered

nisargshah1999 commented 2 years ago

Hi,

Thanks for very interesting code I am trying to run, train_embedding.py but it gives me following error, I have tried searching online, but it seems that there might be some version or logic err.

Below are logs,

number of gpu   :      2
sequence length :      1
train batch size:    100
valid batch size:    100
optimizer choice:      0
multiple optim  :      1
num of epochs   :     20
num of workers  :      8
test crop type  :      1
whether to flip :      1
learning rate   : 0.0005
momentum for sgd: 0.9000
weight decay    : 0.0005
dampening       : 0.0000
use nesterov    :      0
method for sgd  :      1
step for sgd    :      5
gamma for sgd   : 0.1000
train_paths_80  :  26560
train_labels_80 :  26560
valid_paths_80  :   3040
valid_labels_80 :   3040
test_paths_80  :  20875
test_labels_80 :  20875
Invalid MIT-MAGIC-COOKIE-1 keynum train start idx 80:  26560
num of all train use:  26560
num of all valid use:   3040
num of all test use:  20875
/usr/lib/python3/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
../aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:257: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [25,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "/home/nisarg/Downloads/research/cataract/code/Trans-SVNet/train_embedding.py", line 783, in <module>
    main()
  File "/home/nisarg/Downloads/research/cataract/code/Trans-SVNet/train_embedding.py", line 776, in main
    train_model((train_dataset_80),
  File "/home/nisarg/Downloads/research/cataract/code/Trans-SVNet/train_embedding.py", line 547, in train_model
    loss.backward()
  File "/usr/lib/python3/dist-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/lib/python3/dist-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I am running it for my personal dataset, it would be great if you could help me with it, also would be great if you could provide what pytorch and dependencies version were used?

Thanks

xjgaocs commented 2 years ago

As it contains basical Resnet, it should not have a version conflict. You may try Cholec80 dataset first.

nisargshah1999 commented 2 years ago

Thanks for your response. It worked after, changing the number of classes, for others,

xjgaocs / Trans-SVNet

RuntimeError: CUDA error: device-side assert triggered #6