Open yaosx425 opened 4 months ago
There should be some other info telling you which file and which line make this error. Could you provide more details about the error?
There should be some other info telling you which file and which line make this error. Could you provide more details about the error?
Thank you for your reply! Its error message is very long.
Traceback (most recent call last):
File "./tools/train.py", line 199, in
File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 41, in init_dist File "./tools/train.py", line 129, in main init_dist(args.launcher, cfg.dist_params) File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 41, in init_dist main() _init_dist_pytorch(backend, kwargs)
File "./tools/train.py", line 129, in main File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 64, in _init_dist_pytorch _init_dist_pytorch(backend, kwargs)init_dist(args.launcher, cfg.dist_params)
File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 64, in _init_dist_pytorch File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 41, in init_dist dist.init_process_group(backend=backend, kwargs) File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group dist.init_process_group(backend=backend, kwargs) init_dist(args.launcher, **cfg.dist_params) File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
_init_dist_pytorch(backend, kwargs) File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 41, in init_dist File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 64, in _init_dist_pytorch dist.init_process_group(backend=backend, kwargs)_init_dist_pytorch(backend, **kwargs)
File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/mmcv/runner/dist_utils.py", line 64, in _init_dist_pytorch
dist.init_process_group(backend=backend, kwargs)
File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
Traceback (most recent call last):
File "./tools/train.py", line 199, in
File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier barrier() File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier work = _default_pg.barrier()work = _default_pg.barrier()work = _default_pg.barrier()work = _default_pg.barrier()work = _default_pg.barrier()work = _default_pg.barrier()work = _default_pg.barrier()
RuntimeErrorRuntimeErrorRuntimeErrorRuntimeErrorRuntimeError: : RuntimeError: : : NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8RuntimeErrorNCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8: NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8 : NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729138878/work/torch/lib/c10d/ProcessGroupNCCL.cpp:31, unhandled cuda error, NCCL version 2.7.8
Traceback (most recent call last):
File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data2/shuxuan_anaconda/envs/openmmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
The main error is "RuntimeError: NCCL error in: ..., unhandled cuda error, NCCL version 2.7.8".
I recommend that you can try running a simpler baseline, such as faster_rcnn_r50_fpn_1x_coco in MMDetection or rotated_faster_rcnn_r50_fpn_1x_dota_le90 in MMRotate. If the error still exists, it is probably because your nvcc CUDA version has a compatibility conflict with GPU framework, or a version mismatch with PyTorch in conda.
The main error is "RuntimeError: NCCL error in: ..., unhandled cuda error, NCCL version 2.7.8".
I recommend that you can try running a simpler baseline, such as faster_rcnn_r50_fpn_1x_coco in MMDetection or rotated_faster_rcnn_r50_fpn_1x_dota_le90 in MMRotate. If the error still exists, it is probably because your nvcc CUDA version has a compatibility conflict with GPU framework, or a version mismatch with PyTorch in conda.
Hello, I did the test as you said, rotated_faster_rcnn_r50_fpn_1x_dota_le90 but still reported this error.
I wonder if you can provide your GPU name, nvidia-smi CUDA version and nvcc CUDA version, conda PyTorch version. e.g. our GPU environment is:
nvidia-smi
# ...
# Driver Version: 535.171.04 CUDA Version: 12.2
# ...
# NVIDIA A100-SXM4-80GB
# ...
nvcc -V
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2022 NVIDIA Corporation
# Built on Tue_Mar__8_18:18:20_PST_2022
# Cuda compilation tools, release 11.6, V11.6.124
# Build cuda_11.6.r11.6/compiler.31057947_0
conda list
# ...
# mmcv-full 1.6.1 pypi_0 pypi
# mmdet 2.25.1 pypi_0 pypi
# mmengine 0.10.3 pypi_0 pypi
# mmrotate 0.3.3 dev_0 <develop>
# ...
# pytorch 1.12.1 py3.8_cuda11.6_cudnn8.3.2_0 pytorch
# ...
# torchaudio 0.12.1 py38_cu116 pytorch
# torchvision 0.13.1 py38_cu116 pytorch
# ...
Yes, but we still do not know your GPU type (e.g. 2080Ti? A100? 3090? 4090?) and your nvcc CUDA version (with shell command "nvcc -V").
Yes, but we still do not know your GPU type (e.g. 2080Ti? A100? 3090? 4090?) and your nvcc CUDA version (with shell command "nvcc -V").
sorry,I just saw the information. My GPU type is 3090 nvcc CUDA version
This is exactly where the problem is: an nvcc 11.5 CUDA with 10.2 PyTorch. PyTorch version must be same as your nvcc CUDA version.
This is exactly where the problem is: an nvcc 11.5 CUDA with 10.2 PyTorch. PyTorch version must be same as your nvcc CUDA version.
Thank you for your patient answer!
subprocess.CalledProcessError: Command '['/data2/shuxuan_anaconda/envs/openmmlab/bin/python', '-u', './tools/train.py', '--local_rank=7', './configs/rotated_imted/rotated_imted_faster_rcnn_vit_small_1x_dota_le90_8h_stdc_xyawh321v.py', '--seed', '0', '--launcher', 'pytorch']' returned non-zero exit status 1.
Why did I report the above error when I used the following instructions for training?
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./tools/dist_train.sh ./configs/rotated_imted/rotated_imted_faster_rcnn_vit_small_1x_dota_le90_8h_stdc_xyawh321v.py 8 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./tools/dist_test.sh ./configs/rotated_imted/rotated_imted_faster_rcnn_vit_small_1x_dota_le90_8h_stdc_xyawh321v.py ./work_dirs/rotated_imted_faster_rcnn_vit_small_1x_dota_le90_8h_stdc_xyawh321v/epoch_12.pth 8 --format-only --eval-options submission_dir="./work_dirs/Task1_rotated_imted_faster_rcnn_vit_small_1x_dota_le90_8h_stdc_xyawh321v_epoch_12/" python "../DOTA_devkit/dota_evaluation_task1.py" --mergedir "./work_dirs/Task1_rotated_imted_faster_rcnn_vit_small_1x_dota_le90_8h_stdc_xyawh321v_epoch_12/" --imagesetdir "./data/DOTA/val/" --use_07_metric True