Closed MonkeyTB closed 1 year ago
[2023-04-19 06:55:37,947] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown [2023-04-19 06:55:38,294] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown [2023-04-19 06:55:39,105] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 37386) of binary: /home/jovyan/.conda/envs/glm/bin/python Traceback (most recent call last): File "/home/jovyan/.conda/envs/glm/bin/accelerate", line 8, in <module> sys.exit(main()) File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 900, in launch_command deepspeed_launcher(args) File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher distrib_run.run(args) File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== train_deepspeed.py FAILED
请教一下师兄,这个问题应该怎么排查?
你好~我遇到同样的问题,能请教下如何解决吗? @MonkeyTB
请教一下师兄,这个问题应该怎么排查?