Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepspeed error: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). #15

Open
Dongzhikang opened this issue Aug 10, 2023 · 1 comment

Comments

@Dongzhikang
Copy link

Hi I am trining PandaGPT, I have 8 V100 GPUs. When I run ./scripts/train.sh, I got the following error:

Traceback (most recent call last):
File "user/test_panda/PandaGPT/code/train_sft.py", line 97, in
main(**args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 55, in main
config_env(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 45, in config_env
initialize_distributed(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 29, in initialize_distributed
deepspeed.init_distributed(dist_backend='nccl')
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:28457 (errno: 98 - Address already in use).
[2023-08-10 15:50:31,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15743
[2023-08-10 15:50:31,172] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15744
[2023-08-10 15:50:31,180] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15745
[2023-08-10 15:50:31,187] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15746
[2023-08-10 15:50:31,239] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15747
[2023-08-10 15:50:31,291] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15748
[2023-08-10 15:50:31,344] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15749
[2023-08-10 15:50:31,396] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15750

Do you have any idea how to solve this? Thank you so much!

@gmftbyGMFTBY
Copy link
Collaborator

Hi, according to the error log, I think port 28457 is busy in your environment. You could select another free port for running the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants