CUDA error: an illegal memory access was encountered

sunnyHelen commented 1 year ago

Hi, thanks for sharing your great work. When I run this code, I encounter a bug "CUDA error: an illegal memory access was encountered". It happens on the evaluation of epoch 8 if the batch size is 2 and on the evaluation of epoch 1 if the batch size is 4. Could you help me figure this issue out? here is the complete information:

[2023-09-29 23:19:55.970] Epoch 8/50 started.
[loss] = 0.256, [loss_1] = 0.219, [loss_2] = 0.373: 100% 2271/2271 [05:10<00:00, 7.31it/s]
[2023-09-29 23:25:06.476] Training finished in 5 minutes 10 seconds.
19% 24/125 [00:08<00:37, 2.72it/s]
Traceback (most recent call last):
File "train.py", line 110, in
main()
File "train.py", line 91, in main
trainer.train_with_defaults(
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torchpack/train/trainer.py", line 37, in train_with_defaults
self.train(dataflow=dataflow,
File "/home/zcc/zsl/SemanticSTF-master/PointDR/core/trainers.py", line 207, in train
self.trigger_epoch()
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torchpack/train/trainer.py", line 156, in trigger_epoch
self.callbacks.trigger_epoch()
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torchpack/callbacks/callback.py", line 90, in trigger_epoch
self._trigger_epoch()
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torchpack/callbacks/callback.py", line 308, in _trigger_epoch
callback.trigger_epoch()
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torchpack/callbacks/callback.py", line 90, in trigger_epoch
self._trigger_epoch()
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torchpack/callbacks/inference.py", line 29, in _trigger_epoch
self._trigger()
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torchpack/callbacks/inference.py", line 38, in _trigger
output_dict = self.trainer.run_step(feed_dict)
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torchpack/train/trainer.py", line 125, in run_step
output_dict = self._run_step(feed_dict)
File "/home/zcc/zsl/SemanticSTF-master/PointDR/core/trainers.py", line 68, in _run_step
outputs_1, feat_1 = self.model(inputs_1)
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs)
File "/home/zcc/zsl/SemanticSTF-master/PointDR/core/models/semantic_kitti/minkunet_dr.py", line 200, in forward
x1 = self.stage1(x0)
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs)
File "/home/zcc/zsl/SemanticSTF-master/PointDR/core/models/semantic_kitti/minkunet_dr.py", line 74, in forward
out = self.relu(self.net(x) + self.downsample(x))
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/home/zcc/.conda/envs/stf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, **kwargs)
File "torchsparse/nn/modules/conv.pyx", line 99, in torchsparse.nn.modules.conv.Conv3d.forward
File "torchsparse/nn/functional/conv/conv.pyx", line 89, in torchsparse.nn.functional.conv.conv.conv3d
File "torchsparse/nn/functional/conv/kmap/build_kmap.pyx", line 83, in torchsparse.nn.functional.conv.kmap.build_kmap.build_kernel_map
File "torchsparse/nn/functional/conv/kmap/func/hashmap_on_the_fly.pyx", line 63, in torchsparse.nn.functional.conv.kmap.func.hashmap_on_the_fly.build_kmap_imp licit_GEMM_hashmap_on_the_fly
RuntimeError: CUDA error: an illegal memory access was encountered

my environment: Python: 3.8.16 Cuda: 11.1 torch:1.10.0 TorchSparse 2.0.0b0

jerry-dream-fu commented 10 months ago

Python: 3.8.18 Cuda: 11.8 torch:1.13.0+cu116 TorchSparse 2.0.0b0 i think this config is ok。

Barcaaaa commented 10 months ago

Have you resolved this issue? I am also facing the same problem.

weihao1115 commented 8 months ago

Hi @sunnyHelen @Barcaaaa

Thank you for your interest in our work! We also encountered this problem when we tried to migrate the code from torchsparse 2.0.0b0 to 2.1. However, we did not face this problem when we used versions other than 2.1.

Yesterday, I followed the issue in https://github.com/mit-han-lab/torchsparse/issues/239 and successfully solved the problem. I guess it's a common problem in the current version of torchsparse. Please check it.

Lzyin commented 8 months ago

Hi @sunnyHelen @Barcaaaa

Thank you for your interest in our work! We also encountered this problem when we tried to migrate the code from torchsparse 2.0.0b0 to 2.1. However, we did not face this problem when we used versions other than 2.1.

Yesterday, I followed the issue in mit-han-lab/torchsparse#239 and successfully solved the problem. I guess it's a common problem in the current version of torchsparse. Please check it.

Could you please advise on how to resolve this issue? I've adjusted the size as suggested, yet the problem persists.

weihao1115 commented 8 months ago

Hi @sunnyHelen @Barcaaaa Thank you for your interest in our work! We also encountered this problem when we tried to migrate the code from torchsparse 2.0.0b0 to 2.1. However, we did not face this problem when we used versions other than 2.1. Yesterday, I followed the issue in mit-han-lab/torchsparse#239 and successfully solved the problem. I guess it's a common problem in the current version of torchsparse. Please check it.

Could you please advise on how to resolve this issue? I've adjusted the size as suggested, yet the problem persists.

I only encountered this problem when I was trying to use torchsparse++. As the authors from torchsparse say in the aforementioned issue, you can set the kmap_mode as 'hashmap.' That's all I have done to solve this bug. Actually, I have not encountered this problem when I use the previous versions of torchsparse, so if you are facing this problem while using torchsparse 2.0 or below, I don't know how to solve this :).

Lzyin commented 8 months ago

Hi @sunnyHelen @Barcaaaa Thank you for your interest in our work! We also encountered this problem when we tried to migrate the code from torchsparse 2.0.0b0 to 2.1. However, we did not face this problem when we used versions other than 2.1. Yesterday, I followed the issue in mit-han-lab/torchsparse#239 and successfully solved the problem. I guess it's a common problem in the current version of torchsparse. Please check it.

Could you please advise on how to resolve this issue? I've adjusted the size as suggested, yet the problem persists.

I only encountered this problem when I was trying to use torchsparse++. As the authors from torchsparse say in the aforementioned issue, you can set the kmap_mode as 'hashmap.' That's all I have done to solve this bug. Actually, I have not encountered this problem when I use the previous versions of torchsparse, so if you are facing this problem while using torchsparse 2.0 or below, I don't know how to solve this :).

I get it, thank you for your reply!

xiaoaoran / SemanticSTF

CUDA error: an illegal memory access was encountered #3