Open Jonyond-lin opened 1 year ago
你好,想请问一下,使用以下命令进行训练python tools/train_net.py --config-file configs/DPText_DETR/Pretrain/R_50_poly.yaml --num-gpus 2,完整报错信息如下:
python tools/train_net.py --config-file configs/DPText_DETR/Pretrain/R_50_poly.yaml --num-gpus 2
terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: misaligned address Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fb64ea53a22 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x10aa3 (0x7fb64ee10aa3 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fb64ee12147 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fb64ea3d5a4 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10.so) frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7fb6f47612e9 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/to rch/lib/libtorch_python.so) frame #5: c10d::Reducer::~Reducer() + 0x276 (0x7fb6f4757d16 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fb6f4786e32 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/ libtorch_python.so) frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fb6f3ef70f6 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python .so) frame #8: std::_Sp_counted_ptr<c10d::Logger*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x1d (0x7fb6f478b47d in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/l ibtorch_python.so) frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fb6f3ef70f6 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python .so) frame #10: <unknown function> + 0xd891ef (0x7fb6f47891ef in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #11: <unknown function> + 0x4ff8d0 (0x7fb6f3eff8d0 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #12: <unknown function> + 0x500b3e (0x7fb6f3f00b3e in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #13: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4d3abe] frame #14: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4f9606] frame #15: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4d3abe] frame #16: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4f9606] frame #17: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4d3abe] frame #18: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x5a726b] Traceback (most recent call last): File "tools/train_net.py", line 291, in <module> launch( File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/launch.py", line 67, in launch mp.spawn( File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker main_func(*args) File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 285, in run_step losses.backward() File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward Variable._execution_engine.run_backward( RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR
我的pytorch版本跟您readme中的要求一致,我的cudatoolkit版本(也就是nvcc -V)是11.6,我看您的版本是11.1,但是网上说cuda的大版本之间是兼容的,请问我需要更改为11.1吗?或者您能不能结合报错信息给一些debug的意见呢,不胜感激!~
nvcc -V
试了一下把cuda切换成11.1,问题还是存在,不知道咋回事,还请大佬帮忙看看orz
你好,想请问一下,使用以下命令进行训练
python tools/train_net.py --config-file configs/DPText_DETR/Pretrain/R_50_poly.yaml --num-gpus 2
,完整报错信息如下:我的pytorch版本跟您readme中的要求一致,我的cudatoolkit版本(也就是
nvcc -V
)是11.6,我看您的版本是11.1,但是网上说cuda的大版本之间是兼容的,请问我需要更改为11.1吗?或者您能不能结合报错信息给一些debug的意见呢,不胜感激!~