ymy-k / DPText-DETR

[AAAI'23 Oral] DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer
Other
174 stars 22 forks source link

按照所指示的步骤,到了训练哪一步,得到了一个misaligned address错误 #27

Open Jonyond-lin opened 1 year ago

Jonyond-lin commented 1 year ago

你好,想请问一下,使用以下命令进行训练python tools/train_net.py --config-file configs/DPText_DETR/Pretrain/R_50_poly.yaml --num-gpus 2,完整报错信息如下:

terminate called after throwing an instance of 'c10::CUDAError'                                                                                                                                   
  what():  CUDA error: misaligned address                                                                                                                                                         
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):                                                                                
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fb64ea53a22 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10.so)                 
frame #1: <unknown function> + 0x10aa3 (0x7fb64ee10aa3 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)                                          
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7fb64ee12147 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)            
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7fb64ea3d5a4 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libc10.so)                                
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7fb6f47612e9 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/to
rch/lib/libtorch_python.so)                                                                                                                                                                       
frame #5: c10d::Reducer::~Reducer() + 0x276 (0x7fb6f4757d16 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                 
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7fb6f4786e32 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/
libtorch_python.so)                                                                                                                                                                               
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fb6f3ef70f6 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python
.so)                                                                                                                                                                                              
frame #8: std::_Sp_counted_ptr<c10d::Logger*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x1d (0x7fb6f478b47d in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/l
ibtorch_python.so)                                                                                                                                                                                
frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7fb6f3ef70f6 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python
.so)                                                                                                                                                                                              
frame #10: <unknown function> + 0xd891ef (0x7fb6f47891ef in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                    
frame #11: <unknown function> + 0x4ff8d0 (0x7fb6f3eff8d0 in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                    
frame #12: <unknown function> + 0x500b3e (0x7fb6f3f00b3e in /home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/lib/libtorch_python.so)                                    
frame #13: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4d3abe]                                                                                                                       
frame #14: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4f9606]                                                                                                                       
frame #15: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4d3abe]                                                                                                                       
frame #16: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4f9606]                                                                                                                       
frame #17: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x4d3abe]                                                                                                                       
frame #18: /home/cunjian/anaconda3/envs/DPText-DETR/bin/python() [0x5a726b]                                                                                                                       

Traceback (most recent call last):                                                               
  File "tools/train_net.py", line 291, in <module>                                               
    launch(                                     
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/launch.py", line 67, in launch
    mp.spawn(                                   
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')                                                                                                                  
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():                                                                    
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)                                                                                                                            
torch.multiprocessing.spawn.ProcessRaisedException:                                              

-- Process 0 terminated with the following error:                                                
Traceback (most recent call last):                                                               
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)                                
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker
    main_func(*args)                                                     
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
    self._trainer.run_step()                    
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
    losses.backward()                           
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)                                                                                                            
  File "/home/cunjian/anaconda3/envs/DPText-DETR/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(                                                     
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

我的pytorch版本跟您readme中的要求一致,我的cudatoolkit版本(也就是nvcc -V)是11.6,我看您的版本是11.1,但是网上说cuda的大版本之间是兼容的,请问我需要更改为11.1吗?或者您能不能结合报错信息给一些debug的意见呢,不胜感激!~

Jonyond-lin commented 1 year ago

试了一下把cuda切换成11.1,问题还是存在,不知道咋回事,还请大佬帮忙看看orz