Deepspeed error: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). #15

Dongzhikang · 2023-08-10T07:54:40Z

Hi I am trining PandaGPT, I have 8 V100 GPUs. When I run ./scripts/train.sh, I got the following error:

Traceback (most recent call last):
File "user/test_panda/PandaGPT/code/train_sft.py", line 97, in
main(**args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 55, in main
config_env(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 45, in config_env
initialize_distributed(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 29, in initialize_distributed
deepspeed.init_distributed(dist_backend='nccl')
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:28457 (errno: 98 - Address already in use).
[2023-08-10 15:50:31,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15743
[2023-08-10 15:50:31,172] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15744
[2023-08-10 15:50:31,180] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15745
[2023-08-10 15:50:31,187] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15746
[2023-08-10 15:50:31,239] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15747
[2023-08-10 15:50:31,291] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15748
[2023-08-10 15:50:31,344] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15749
[2023-08-10 15:50:31,396] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15750

Do you have any idea how to solve this? Thank you so much！

gmftbyGMFTBY · 2023-08-10T08:51:12Z

Hi, according to the error log, I think port 28457 is busy in your environment. You could select another free port for running the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepspeed error: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). #15

Deepspeed error: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). #15

Dongzhikang commented Aug 10, 2023

gmftbyGMFTBY commented Aug 10, 2023

Deepspeed error: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). #15

Deepspeed error: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). #15

Comments

Dongzhikang commented Aug 10, 2023

gmftbyGMFTBY commented Aug 10, 2023