yxuansu / PandaGPT

[TLLM'23] PandaGPT: One Model To Instruction-Follow Them All
https://panda-gpt.github.io/
Apache License 2.0
760 stars 60 forks source link

Deepspeed error: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). #15

Open Dongzhikang opened 1 year ago

Dongzhikang commented 1 year ago

Hi I am trining PandaGPT, I have 8 V100 GPUs. When I run ./scripts/train.sh, I got the following error:

Traceback (most recent call last): File "user/test_panda/PandaGPT/code/train_sft.py", line 97, in main(**args) File "user/test_panda/PandaGPT/code/train_sft.py", line 55, in main config_env(args) File "user/test_panda/PandaGPT/code/train_sft.py", line 45, in config_env initialize_distributed(args) File "user/test_panda/PandaGPT/code/train_sft.py", line 29, in initialize_distributed deepspeed.init_distributed(dist_backend='nccl') File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in init self.init_process_group(backend, timeout, init_method, rank, world_size) File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group torch.distributed.init_process_group(backend, File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store return TCPStore( RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:28457 (errno: 98 - Address already in use). [2023-08-10 15:50:31,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15743 [2023-08-10 15:50:31,172] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15744 [2023-08-10 15:50:31,180] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15745 [2023-08-10 15:50:31,187] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15746 [2023-08-10 15:50:31,239] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15747 [2023-08-10 15:50:31,291] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15748 [2023-08-10 15:50:31,344] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15749 [2023-08-10 15:50:31,396] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15750

Do you have any idea how to solve this? Thank you so much!

gmftbyGMFTBY commented 1 year ago

Hi, according to the error log, I think port 28457 is busy in your environment. You could select another free port for running the code.