wurenkai / UltraLight-VM-UNet

[arXiv] The official code for "UltraLight VM-UNet: Parallel Vision Mamba Significantly Reduces Parameters for Skin Lesion Segmentation".
235 stars 33 forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) #6

Closed gexinyuan1997 closed 7 months ago

gexinyuan1997 commented 7 months ago

Thanks to the authors for the excellent work. I tried to train the model on my own dataset, however I ran into two problems.

My environment is : Ubuntu 18.04 RTX 3090 24g CUDA 11.7

The dataset was made using Prepare_your_dataset.py, scipy==1.2.1.

The problem 1# is :

----------Creating logger----------

----------GPU init----------

----------Preparing dataset----------

----------Prepareing Models----------

SC_Att_Bridge was used

----------Prepareing loss, opt, sch and amp----------

----------Set other params----------

----------Training----------

../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [96,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [97,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [98,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [99,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [100,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [101,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [102,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [103,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [104,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [105,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [106,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [107,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [108,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [109,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [110,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [111,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [112,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [113,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [114,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [115,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [116,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [117,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [118,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [119,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [120,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [121,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [122,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [123,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [124,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [125,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [126,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [127,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [32,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [33,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [34,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [35,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [36,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [37,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [38,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [39,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [40,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [41,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [42,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [43,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [44,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [45,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [46,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [47,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [48,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [49,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [50,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [51,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [52,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [53,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [54,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [55,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [56,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [57,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [58,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [59,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [60,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [61,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [62,0,0] Assertion input_val >= zero && input_val <= one failed. ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [63,0,0] Assertion input_val >= zero && input_val <= one failed. Traceback (most recent call last): File "/home/sar8186/vmunet/UltraLight-VM-UNet-main/train.py", line 189, in main(config) File "/home/sar8186/vmunet/UltraLight-VM-UNet-main/train.py", line 132, in main train_one_epoch( File "/home/sar8186/vmunet/UltraLight-VM-UNet-main/engine.py", line 40, in train_one_epoch loss.backward() File "/home/sar8186/anaconda3/envs/vmunet/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/sar8186/anaconda3/envs/vmunet/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Problem 2# came up after I resized the batch size to 2 :

----------Creating logger----------

----------GPU init----------

----------Preparing dataset----------

----------Prepareing Models----------

SC_Att_Bridge was used

----------Prepareing loss, opt, sch and amp----------

----------Set other params----------

----------Training----------

train: epoch 1, iter:0, loss: 1.8564, lr: 0.001 Traceback (most recent call last): File "/home/sar8186/vmunet/UltraLight-VM-UNet-main/train.py", line 189, in main(config) File "/home/sar8186/vmunet/UltraLight-VM-UNet-main/train.py", line 132, in main train_one_epoch( File "/home/sar8186/vmunet/UltraLight-VM-UNet-main/engine.py", line 40, in train_one_epoch loss.backward() File "/home/sar8186/anaconda3/envs/vmunet/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/sar8186/anaconda3/envs/vmunet/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Process finished with exit code 1

Using the same environment, I was able to train VM-Unet.

wurenkai commented 7 months ago

Hi. There are a couple of things that can be checked for the two issues that are occurring with your run. For the first problem, you need to check that the number of categories 'num_classes' in 'config_setting.py' is 1 and that the model output maintains the activation function of the 'return torch.sigmoid(out0)' sigmoid. Also, make sure that the data is properly loaded into the '.npy' file. Normally running 'Prepare_your_dataset.py' correctly will show the file number:

1
2
3
...

Also, you might want to check out the discussion in issue 3. The second problem I have not encountered anything similar in my runs. I have kept the batchsize to 8 on all my runs. I hope my answer will help you.

wurenkai commented 7 months ago

We tried adjusting the batchsize to 2 this morning and it ran without issue. Have you ever run it correctly on the ISIC2017 dataset, which could help you with your own dataset, it's more likely that you're having problems processing the data images and labels, which is causing the data to fail to load correctly into the '.npy' file. 11

gexinyuan1997 commented 7 months ago

Many thanks to the author for your reply. I have found out that the problem is that he number of training set, validation set, and test set seems to need to be divided strictly by the actual number of data.

chenyucong1 commented 7 months ago

I have the same problem. How can I solve it?(url)

感谢作者的出色工作。我试图在自己的数据集上训练模型,但是我遇到了两个问题。

我的环境是:Ubuntu 18.04 RTX 3090 24g CUDA 11.7

数据集是使用 scipy==1.2.1 制作的。Prepare_your_dataset.py

问题 1# 是: #----------创建记录器----------# #----------GPU init----------# #----------准备数据集----------# #----------准备模型----------# 使用了SC_Att_Bridge #----------准备损失、opt、sch 和 amp----------# #----------设置其他参数----------# #----------训练----------#../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [96,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [97,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [98,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [99,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [100,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [101,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [102,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [103,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [104,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [105,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [106,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [107,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [108,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [109,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [110,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [111,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [112,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92:operator():块:[321,0,0],线程:[113,0,0]断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [114,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [115,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [116,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [117,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [118,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [119,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92:operator():块:[321,0,0],线程:[120,0,0]断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92:operator():块:[321,0,0], 线程: [121,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [122,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [123,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [124,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [125,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [126,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [321,0,0], thread: [127,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [32,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [33,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [34,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [35,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92:operator():块:[566,0,0],线程:[36,0,0]断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [37,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [38,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [39,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [40,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [41,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [42,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [43,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [44,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [45,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [46,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [47,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [48,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92:operator():块:[566,0,0],线程:[49,0,0]断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [50,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [51,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [52,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [53,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92:operator():块:[566,0,0],线程: [54,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [55,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [56,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [57,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92:operator():块:[566,0,0],线程:[58,0,0]断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [59,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [60,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [61,0,0] 断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92:operator():块:[566,0,0],线程:[62,0,0]断言失败。 ../aten/src/ATen/native/cuda/Loss.cu:92: operator(): block: [566,0,0], thread: [63,0,0] 断言失败。回溯(最近一次调用):文件“/home/sar8186/vmunet/UltraLight-VM-UNet-main/train.py”,第 189 行,在 main(config) 中 文件“/home/sar8186/vmunet/UltraLight-VM-UNet-main/train.py”,第 132 行,在 main train_one_epoch( 文件“/home/sar8186/vmunet/UltraLight-VM-UNet-main/engine.py”,第 40 行,在 train_one_epoch loss.backward() 中文件 “/home/sar8186/anaconda3/envs/vmunet/lib/python3.8/site-packages/torch/_tensor.py”,第 487 行,在 backward torch.autograd.backward( 文件 “/home/sar8186/anaconda3/envs/vmunet/lib/python3.8/site-packages/torch/autograd/init.py”,第 197 行,在 backward Variable._execution_engine.run_backward( # 调用 C++ 引擎运行向后传递 RuntimeError:找不到有效的 cuDNN 算法来运行卷积input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one``input_val >= zero && input_val <= one

将批处理大小调整为 2 后出现问题 2#:#----------创建记录器----------# #----------GPU init----------# #----------准备数据集----------# #----------准备模型----------# 使用了SC_Att_Bridge #----------准备损失、opt、sch 和 amp----------# #----------设置其他参数----------# #----------训练----------#列车:第 1 集,ITER:0,损失:1.8564,LR:0.001 回溯(最近一次调用):文件“/home/sar8186/vmunet/UltraLight-VM-UNet-main/train.py”,第 189 行,在 main(config) 中 文件“/home/sar8186/vmunet/UltraLight-VM-UNet-main/train.py”,第 132 行,在 main train_one_epoch( 文件“/home/sar8186/vmunet/UltraLight-VM-UNet-main/engine.py”,第 40 行,在 train_one_epoch loss.backward()文件 “/home/sar8186/anaconda3/envs/vmunet/lib/python3.8/site-packages/torch/_tensor.py”,第 487 行,在 backward torch.autograd.backward( 文件 “/home/sar8186/anaconda3/envs/vmunet/lib/python3.8/site-packages/torch/autograd/init.py”,第 197 行,在 backward Variable._execution_engine.run_backward( # 调用 C++ 引擎以运行向后传递 RuntimeError: CUDA 错误: CUBLAS_STATUS_EXECUTION_FAILED 当叫cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

使用退出代码 1 完成的过程

使用相同的环境,我能够训练 VM-Unet。

wurenkai commented 7 months ago

@chenyucong1 Hi, we summarize the previous mentions related to this issue and you can check them according to this: It is recommended to replicate in ISIC2017 data, which can help you to troubleshoot data preprocessing issues and environmental issues on a case-by-case basis. Again the ISIC2017 dataset can provide you with a template reference when preparing your own dataset. Most of the issues were resolved here.