xiuqhou / Salience-DETR

[CVPR 2024] Official implementation of the paper "Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement"
https://arxiv.org/abs/2403.16131
Apache License 2.0
105 stars 7 forks source link

raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #4

Closed z972778371 closed 3 months ago

z972778371 commented 3 months ago

第一次运行accelerate main.py后,程序加载到下载resnet50的预训练模型,但是没下载完,然后报错RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory 可能是网络问题,但我退出终端,想再次运行的时候,程序不下载文件了,转而报错

[2024-05-08 08:47:13 det.models.backbones.base_backbone]: Backbone architecture: resnet50 Loading extension module MultiScaleDeformableAttention... Traceback (most recent call last): File "main.py", line 205, in train() File "main.py", line 124, in train model = Config(cfg.model_path).model File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init exec(code, name_space) File "", line 34, in File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new weights = load_checkpoint(default_weight if weights is None else weights) File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint return torch.hub.load_state_dict_from_url(file_name, map_location=map_location) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url return torch.load(cached_file, map_location=map_location, weights_only=weights_only) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init super().init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory Traceback (most recent call last): File "main.py", line 205, in train() File "main.py", line 124, in train model = Config(cfg.model_path).model File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init exec(code, name_space) File "", line 34, in File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new weights = load_checkpoint(default_weight if weights is None else weights) File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint return torch.hub.load_state_dict_from_url(file_name, map_location=map_location) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url return torch.load(cached_file, map_location=map_location, weights_only=weights_only) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init super().init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory Traceback (most recent call last): File "main.py", line 205, in train() File "main.py", line 124, in train model = Config(cfg.model_path).model File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init exec(code, name_space) File "", line 34, in File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new weights = load_checkpoint(default_weight if weights is None else weights) File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint return torch.hub.load_state_dict_from_url(file_name, map_location=map_location) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url return torch.load(cached_file, map_location=map_location, weights_only=weights_only) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init super().init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory Traceback (most recent call last): File "main.py", line 205, in train() File "main.py", line 124, in train model = Config(cfg.model_path).model File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init exec(code, name_space) File "", line 34, in File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new weights = load_checkpoint(default_weight if weights is None else weights) File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint return torch.hub.load_state_dict_from_url(file_name, map_location=map_location) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url return torch.load(cached_file, map_location=map_location, weights_only=weights_only) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init super().init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory [2024-05-08 08:47:31,260] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5937 closing signal SIGTERM [2024-05-08 08:47:31,261] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5938 closing signal SIGTERM [2024-05-08 08:47:31,877] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 5939) of binary: /home/ubuntu/anaconda3/envs/salience_detr/bin/python Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/salience_detr/bin/accelerate", line 8, in sys.exit(main()) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1073, in launch_command multi_gpu_launcher(args) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher distrib_run.run(args) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures: [1]: time : 2024-05-08_08:47:31 host : ubuntu-X640-G30 rank : 3 (local_rank: 3) exitcode : 1 (pid: 5940) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-05-08_08:47:31 host : ubuntu-X640-G30 rank : 2 (local_rank: 2) exitcode : 1 (pid: 5939) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

请问现在该怎么办?