yanqiangmiffy / InstructGLM

ChatGLM-6B 指令学习|指令数据|Instruct
MIT License
654 stars 51 forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #26

Closed MonkeyTB closed 1 year ago

MonkeyTB commented 1 year ago
[2023-04-19 06:55:37,947] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-19 06:55:38,294] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-04-19 06:55:39,105] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 37386) of binary: /home/jovyan/.conda/envs/glm/bin/python
Traceback (most recent call last):
  File "/home/jovyan/.conda/envs/glm/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 900, in launch_command
    deepspeed_launcher(args)
  File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jovyan/.conda/envs/glm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
train_deepspeed.py FAILED

请教一下师兄,这个问题应该怎么排查?

SCAUapc commented 1 year ago

你好~我遇到同样的问题,能请教下如何解决吗? @MonkeyTB