Open lqh964165950 opened 1 week ago
Could you please provide more details about the error log and the environment?
报错信息:
Using port 22203 for synchronization.
Training command is /home/gxu4090x2/.conda/envs/sod/bin/python3.11 -m torch.distributed.launch --nproc_per_node=1 --master_port=22203 /home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmdet/.mim/tools/train.py configs/dior/catnet_r50_3x_dior.py --launcher pytorch.
/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.1.1 as it may crash. To support both 1.x and 2.x versions of NumPy, modules must be compiled with NumPy 2.0. Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to downgrade to 'numpy<2' or try to upgrade the affected module. We expect that some modules will need time to support NumPy 2.
Traceback (most recent call last): File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmdet/.mim/tools/train.py", line 10, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmengine/config/config.py", line 182, in fromfile import_modules_from_strings(**cfg_dict['custom_imports']) File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmengine/utils/misc.py", line 84, in import_modules_from_strings raise ImportError(f'Failed to import {imp}') ImportError: Failed to import models
The above exception was the direct cause of the following exception:
PYTHONPATH
to make sys.path
include the directory which contains your custom module
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 629591) of binary: /home/gxu4090x2/.conda/envs/sod/bin/python3.11
Traceback (most recent call last):
File "Failures:
Looks like the problem from mmcv side. Please make sure mmcv is correctly installed.
After inputting the command: pip show mmcv,the details are: Name: mmcv Version: 2.0.1 Summary: OpenMMLab Computer Vision Foundation Home-page: https://github.com/open-mmlab/mmcv Author: MMCV Contributors Author-email: openmmlab@gmail.com License: Location: /home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages Requires: addict, mmengine, numpy, opencv-python, packaging, Pillow, pyyaml, yapf Required-by: How to know whether mmcv is correctly installed?
Your error log says ModuleNotFoundError: No module named 'mmcv._ext'
, which means the CUDA extensions were not compiled successfully, only the Python part was installed. You may refer to mmcv's repo for details.
What should I do?Since mmcv is installed correctly.
Please create an issue in mmcv's repo.
subprocess.CalledProcessError: Command '['/home/gxu4090x2/.conda/envs/cat/bin/python3.11', '-m', 'torch.distributed.launch', '--nproc_per_node=1', '--master_port=26968', '/home/gxu4090x2/.conda/envs/cat/lib/python3.11/site-packages/mmdet/.mim/tools/train.py', 'configs/dior/catnet_r50_3x_dior.py', '--launcher', 'pytorch']' returned non-zero exit status 1. 请问遇到这个问题该怎么解决?