You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deepspeed error: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use).
#15
Open
Dongzhikang opened this issue
Aug 10, 2023
· 1 comment
Hi I am trining PandaGPT, I have 8 V100 GPUs. When I run ./scripts/train.sh, I got the following error:
Traceback (most recent call last):
File "user/test_panda/PandaGPT/code/train_sft.py", line 97, in
main(**args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 55, in main
config_env(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 45, in config_env
initialize_distributed(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 29, in initialize_distributed
deepspeed.init_distributed(dist_backend='nccl')
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:28457 (errno: 98 - Address already in use).
[2023-08-10 15:50:31,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15743
[2023-08-10 15:50:31,172] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15744
[2023-08-10 15:50:31,180] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15745
[2023-08-10 15:50:31,187] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15746
[2023-08-10 15:50:31,239] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15747
[2023-08-10 15:50:31,291] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15748
[2023-08-10 15:50:31,344] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15749
[2023-08-10 15:50:31,396] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15750
Do you have any idea how to solve this? Thank you so much!
The text was updated successfully, but these errors were encountered:
Hi I am trining PandaGPT, I have 8 V100 GPUs. When I run ./scripts/train.sh, I got the following error:
Traceback (most recent call last):
File "user/test_panda/PandaGPT/code/train_sft.py", line 97, in
main(**args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 55, in main
config_env(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 45, in config_env
initialize_distributed(args)
File "user/test_panda/PandaGPT/code/train_sft.py", line 29, in initialize_distributed
deepspeed.init_distributed(dist_backend='nccl')
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 624, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 60, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 86, in init_process_group
torch.distributed.init_process_group(backend,
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/tiger/.local/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:28457 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:28457 (errno: 98 - Address already in use).
[2023-08-10 15:50:31,171] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15743
[2023-08-10 15:50:31,172] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15744
[2023-08-10 15:50:31,180] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15745
[2023-08-10 15:50:31,187] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15746
[2023-08-10 15:50:31,239] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15747
[2023-08-10 15:50:31,291] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15748
[2023-08-10 15:50:31,344] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15749
[2023-08-10 15:50:31,396] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 15750
Do you have any idea how to solve this? Thank you so much!
The text was updated successfully, but these errors were encountered: