RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

Error when training WHU:

python main.py --file_root WHU --max_steps 80000 --model_type tiny --batch_size 8 --lr 2e-4 --gpu_id 0

Epoch No. 4:    Train Loss = 0.3881     Val Loss = 0.3136        F1(tr) = 0.8638         F1(val) = 0.8978          
iteration: [5300/80920] f1: 0.398 lr: 0.0001882 loss: 0.660 time:4.223 h../aten/src/ATen/native/cuda/Loss.cu:94: op
erator(): block: [96,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.  
../aten/src/ATen/native/cuda/Loss.cu:94: operator(): block: [96,0,0], thread: [30,0,0] Assertion `input_val >= zero && input_val <= one` failed.
../aten/src/ATen/native/cuda/Loss.cu:94: operator(): block: [96,0,0], thread: [31,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
  File "/home/incar/software/ChangeViT/main.py", line 315, in <module>
    trainValidateSegmentation(args)
  File "/home/incar/software/ChangeViT/main.py", line 237, in trainValidateSegmentation
    train(args, trainLoader, model, optimizer, epoch, max_batches, cur_iter)
  File "/home/incar/software/ChangeViT/main.py", line 111, in train
    loss.backward()
  File "/home/incar/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/home/incar/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/home/incar/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/incar/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/function.py", line 301, in apply
    return user_fn(self, *args)
  File "/home/incar/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/function.py", line 619, in wrapper
    outputs = fn(ctx, *args)
  File "/home/incar/miniconda3/envs/py3.10/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 152, in backward
    grads = _memory_efficient_attention_backward(
  File "/home/incar/miniconda3/envs/py3.10/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 522, in _memory_efficient_attention_backward
    grads = op.apply(ctx, inp, grad)
  File "/home/incar/miniconda3/envs/py3.10/lib/python3.10/site-packages/xformers/ops/fmha/cutlass.py", line 446, in apply
    (grad_q, grad_k, grad_v, grad_bias) = cls.OPERATOR(
  File "/home/incar/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_ops.py", line 854, in __call__
    return self_._op(*args, **(kwargs or {}))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I used the data at https://www.dropbox.com/scl/fi/8gczkg78fh95yofq5bs7p/WHU.zip?rlkey=05bpczx0gdp99hl6o2xr1zvyj&dl=0 as the dataset, repeated it as the test set, and got an error when training at the 4th epoch. How to solve it?

zhuduowang / ChangeViT

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #4