yuantianyuan01 / StreamMapNet

GNU General Public License v3.0
186 stars 16 forks source link

报错提示RuntimeError: Default process group has not been initialized, please make sure to call init_process_group #31

Open chenxinhe1 opened 1 month ago

chenxinhe1 commented 1 month ago

当我进行debug的时候,发生如下问题: 2024-07-18 08:37:26,678 - mmdet - INFO - Checkpoints will be saved to /home/cxh/StreamMapNet/work_dirs/nusc_newsplit_480_60x30_24e by HardDiskBackend. Backend TkAgg is interactive backend. Turning interactive mode on. Traceback (most recent call last): File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/pydevd.py", line 1500, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/cxh/StreamMapNet/tools/train.py", line 272, in main() File "/home/cxh/StreamMapNet/tools/train.py", line 261, in main custom_train_model( File "/home/cxh/StreamMapNet/plugin/core/apis/train.py", line 30, in custom_train_model custom_train_detector( File "/home/cxh/StreamMapNet/plugin/core/apis/mmdet_train.py", line 203, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run iter_runner(iter_loaders[i], kwargs) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 64, in train outputs = self.model.train_step(data_batch, self.optimizer, kwargs) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step return self.module.train_step(inputs[0], kwargs[0]) File "/home/cxh/StreamMapNet/plugin/models/mapers/base_mapper.py", line 125, in train_step loss, log_vars, num_samples = self(data_dict) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/cxh/StreamMapNet/plugin/models/mapers/base_mapper.py", line 93, in forward return self.forward_train(*args, *kwargs) File "/home/cxh/StreamMapNet/plugin/models/mapers/StreamMapNet.py", line 173, in forward_train _bev_feats = self.backbone(img, img_metas=img_metas, points=points) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/cxh/StreamMapNet/plugin/models/backbones/bevformer_backbone.py", line 173, in forward mlvl_feats = self.extract_img_feat(img=img, img_metas=img_metas) File "/home/cxh/StreamMapNet/plugin/models/backbones/bevformer_backbone.py", line 144, in extract_img_feat img_feats = self.img_neck(img_feats) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, kwargs) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(*args, *kwargs) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmdet/models/necks/fpn.py", line 157, in forward laterals = [ File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmdet/models/necks/fpn.py", line 158, in lateral_conv(inputs[i + self.start_level]) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/cnn/bricks/conv_module.py", line 209, in forward x = self.norm(x) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 731, in forward world_size = torch.distributed.get_world_size(process_group) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 748, in get_world_size return _get_group_size(group) File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 274, in _get_group_size default_pg = _get_default_group() File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group raise RuntimeError("Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. python-BaseException Traceback (most recent call last): File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_frame.py", line 828, in trace_dispatch if main_debugger.in_project_scope(frame.f_code.co_filename): File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/pydevd.py", line 612, in in_project_scope return pydevd_utils.in_project_roots(filename) AttributeError: 'NoneType' object has no attribute 'in_project_roots'

Process finished with exit code 1 请问如何解决,感谢,盼复

yuantianyuan01 commented 1 month ago

Hi, thanks for your interest in our project.

The error logs seem to be caused by synchronized batchnorm, which requires distributed training. We suggest always using tools/dist_train.sh to start training, even for debugging (you can set the world size to 1 for breakpoints). Also our dataset sampler can only be initialized correctly under DDP mode.

你好,谢谢你对我们项目的关注。

这个报错看起来是synchronized batchnorm导致的,不使用DDP的话就会报错,我建议即使是debug也使用tools/dist_train.sh进行训练,你可以把gpu数设成1,这样也可以设置断点。另外我们的data sampler也是只有在DDP模式下才能正确初始化。