Open Amireux52 opened 4 months ago
Hi, thanks for your interest in our project.
The error logs seem to be caused by synchronized batchnorm, which requires distributed training. We suggest always using tools/dist_train.sh
to start training, even for debugging (you can set the world size to 1 for breakpoints). Also our dataset sampler can only be initialized correctly under DDP mode.
你好,谢谢你对我们项目的关注。
这个报错看起来是synchronized batchnorm导致的,不使用DDP的话就会报错,我建议即使是debug也使用tools/dist_train.sh
进行训练,你可以把gpu数设成1,这样也可以设置断点。另外我们的data sampler也是只有在DDP模式下才能正确初始化。
当我进行debug的时候,发生如下问题: 2024-07-18 08:37:26,678 - mmdet - INFO - Checkpoints will be saved to /home/cxh/StreamMapNet/work_dirs/nusc_newsplit_480_60x30_24e by HardDiskBackend. Backend TkAgg is interactive backend. Turning interactive mode on. Traceback (most recent call last): File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/pydevd.py", line 1500, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/cxh/StreamMapNet/tools/train.py", line 272, in
main()
File "/home/cxh/StreamMapNet/tools/train.py", line 261, in main
custom_train_model(
File "/home/cxh/StreamMapNet/plugin/core/apis/train.py", line 30, in custom_train_model
custom_train_detector(
File "/home/cxh/StreamMapNet/plugin/core/apis/mmdet_train.py", line 203, in custom_train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 64, in train
outputs = self.model.train_step(data_batch, self.optimizer, kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
return self.module.train_step(inputs[0], kwargs[0])
File "/home/cxh/StreamMapNet/plugin/models/mapers/base_mapper.py", line 125, in train_step
loss, log_vars, num_samples = self(data_dict)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/home/cxh/StreamMapNet/plugin/models/mapers/base_mapper.py", line 93, in forward
return self.forward_train(*args, *kwargs)
File "/home/cxh/StreamMapNet/plugin/models/mapers/StreamMapNet.py", line 173, in forward_train
_bev_feats = self.backbone(img, img_metas=img_metas, points=points)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(input, kwargs)
File "/home/cxh/StreamMapNet/plugin/models/backbones/bevformer_backbone.py", line 173, in forward
mlvl_feats = self.extract_img_feat(img=img, img_metas=img_metas)
File "/home/cxh/StreamMapNet/plugin/models/backbones/bevformer_backbone.py", line 144, in extract_img_feat
img_feats = self.img_neck(img_feats)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, *kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmdet/models/necks/fpn.py", line 157, in forward
laterals = [
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmdet/models/necks/fpn.py", line 158, in
lateral_conv(inputs[i + self.start_level])
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call( input, kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/mmcv/cnn/bricks/conv_module.py", line 209, in forward
x = self.norm(x)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 731, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 748, in get_world_size
return _get_group_size(group)
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 274, in _get_group_size
default_pg = _get_default_group()
File "/home/cxh/anaconda3/envs/streammapnet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
python-BaseException
Traceback (most recent call last):
File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_frame.py", line 828, in trace_dispatch
if main_debugger.in_project_scope(frame.f_code.co_filename):
File "/home/cxh/pycharm-community-2023.2.3/plugins/python-ce/helpers/pydev/pydevd.py", line 612, in in_project_scope
return pydevd_utils.in_project_roots(filename)
AttributeError: 'NoneType' object has no attribute 'in_project_roots'
Process finished with exit code 1 请问如何解决,感谢,盼复