zju3dv / snake

Code for "Deep Snake for Real-Time Instance Segmentation" CVPR 2020 oral
Other
1.15k stars 229 forks source link

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #246

Closed yilinliu610730 closed 1 year ago

yilinliu610730 commented 1 year ago

loading annotations into memory... Done (t=0.01s) creating index... index created! loading annotations into memory... Done (t=0.05s) creating index... index created! loading annotations into memory... Done (t=0.07s) creating index... index created! THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=383 error=8 : invalid device function Traceback (most recent call last): File "/home/VANDERBILT/liuy99/Documents/snake/train_net.py", line 54, in main() File "/home/VANDERBILT/liuy99/Documents/snake/train_net.py", line 50, in main train(cfg, network) File "/home/VANDERBILT/liuy99/Documents/snake/train_net.py", line 25, in train trainer.train(epoch, train_loader, optimizer, recorder) File "/home/VANDERBILT/liuy99/Documents/snake/lib/train/trainers/trainer.py", line 38, in train output, loss, loss_stats, image_stats = self.network(batch) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "lib/train/trainers/snake.py", line 19, in forward output = self.net(batch['inp'], batch) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "lib/networks/snake/ct_snake.py", line 54, in forward output, cnn_feature = self.dla(x) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "lib/networks/snake/dla.py", line 469, in forward x = self.base(x) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "lib/networks/snake/dla.py", line 289, in forward x = self.base_layer(x) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, **kwargs) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward exponential_average_factor, self.eps) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Process finished with exit code 1

参数: CUDA9.0 显卡NVIDIA RTX A5000 Pytorch 1.1.0

image

试过:

  1. pip3 install -U https://download.pytorch.org/whl/cu90/torch-1.1.0-cp37-cp37m-linux_x86_64.whl装torch
  2. 重启
  3. 减少batch_size,现在batch_size = 1

我看说是不是cuda9.0没法支持,3070以上显卡应该只支持cuda11以上版本?有这个说法吗? 之前试过CUDA11.4,dcn没法装,requirements.txt这些装的时候报错很多,所以换了deep snake官方版本CUDA9.0和Pytorch 1.1.0。 有什么建议吗?

pengsida commented 1 year ago

cuda9.0确实好像没法支持3070以上的显卡。

Brion112233 commented 1 year ago

loading annotations into memory... Done (t=0.01s) creating index... index created! loading annotations into memory... Done (t=0.05s) creating index... index created! loading annotations into memory... Done (t=0.07s) creating index... index created! THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=383 error=8 : invalid device function Traceback (most recent call last): File "/home/VANDERBILT/liuy99/Documents/snake/train_net.py", line 54, in main() File "/home/VANDERBILT/liuy99/Documents/snake/train_net.py", line 50, in main train(cfg, network) File "/home/VANDERBILT/liuy99/Documents/snake/train_net.py", line 25, in train trainer.train(epoch, train_loader, optimizer, recorder) File "/home/VANDERBILT/liuy99/Documents/snake/lib/train/trainers/trainer.py", line 38, in train output, loss, loss_stats, image_stats = self.network(batch) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], kwargs[0]) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call* result = self.forward(input, kwargs) File "lib/train/trainers/snake.py", line 19, in forward output = self.net(batch['inp'], batch) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "lib/networks/snake/ct_snake.py", line 54, in forward output, cnn_feature = self.dla(x) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "lib/networks/snake/dla.py", line 469, in forward x = self.base(x) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call* result = self.forward(input, kwargs) File "lib/networks/snake/dla.py", line 289, in forward x = self.base_layer(x) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call* result = self.forward(input, **kwargs) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward exponential_average_factor, self.eps) File "/home/VANDERBILT/liuy99/anaconda3/envs/snake/lib/python3.7/site-packages/torch/nn/functional.py", line 1697, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Process finished with exit code 1

参数: CUDA9.0 显卡NVIDIA RTX A5000 Pytorch 1.1.0 image

试过:

  1. pip3 install -U https://download.pytorch.org/whl/cu90/torch-1.1.0-cp37-cp37m-linux_x86_64.whl装torch
  2. 重启
  3. 减少batch_size,现在batch_size = 1

我看说是不是cuda9.0没法支持,3070以上显卡应该只支持cuda11以上版本?有这个说法吗? 之前试过CUDA11.4,dcn没法装,requirements.txt这些装的时候报错很多,所以换了deep snake官方版本CUDA9.0和Pytorch 1.1.0。 有什么建议吗?

cuda11.4可以装dcn,试试pytorch1.11.0+dcn_v2(pytorch_1.11.0)版本。