Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

代码似乎只能在3个gpu的单机运行才行。否则出错。能否改成在单gpu单机上运行? #1

Open
springfall2018 opened this issue Nov 16, 2023 · 0 comments

Comments

@springfall2018
Copy link

您好。我发现您们的代码似乎只能在3个gpu的单机运行才行。否则出错
例如我在google colab上运行
!python fed_seed_run.py /content/drive/MyDrive/workspace/ fedavg mrpc adapter 1100 0,1,3

训练相当长时间,报错收不到客户消息超时退出。想改成!python fed_seed_run.py /content/drive/MyDrive/workspace/ fedavg mrpc adapter 1100 0 运行也报错
因为colab只能有一个gpu,不知您有没有办法让他在单gpu机器上运行?(或哪里可找到多gpu的云计算平台)

11-13/23:28:18|INFO |base_client.py:292|MRPC Train, Client:14, Loss:0.318, Accuracy:0.939
11-13/23:28:36|INFO |base_client.py:292|MRPC Train, Client:3, Loss:0.313, Accuracy:0.970
11-13/23:28:54|INFO |base_client.py:292|MRPC Train, Client:35, Loss:0.324, Accuracy:0.939
11-13/23:29:13|INFO |base_client.py:292|MRPC Train, Client:31, Loss:0.309, Accuracy:0.970
Traceback (most recent call last):
File "/content/drive/MyDrive/workspace/code/FedETuning/main.py", line 20, in
main()
File "/content/drive/MyDrive/workspace/code/FedETuning/main.py", line 16, in main
trainer.train()
File "/content/drive/MyDrive/workspace/code/FedETuning/trainers/FedBaseTrainer.py", line 91, in train
self.server_manger.run()
File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/network_manager.py", line 38, in run
self.main_loop()
File "/content/drive/MyDrive/workspace/code/FedETuning/trainers/BaseServer/base_server.py", line 253, in main_loop
sender_rank, message_code, payload = self._network.recv()
File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/network.py", line 102, in recv
sender_rank, message_code, content = PackageProcessor.recv_package(
File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/communicator/processor.py", line 118, in recv_package
sender_rank, _, slices_size, message_code, data_type = recv_header(
File "/content/drive/MyDrive/workspace/code/FedETuning/fedlab/core/communicator/processor.py", line 96, in recv_header
dist.recv(buffer, src=src)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1632, in recv
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 18000000ms for recv operation to complete
sh: 0: getcwd() failed: Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
sh: 0: getcwd() failed: Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
sh: 0: getcwd() failed: Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
sh: 0: getcwd() failed: Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
python3: can't open file 'main.py': [Errno 107] Transport endpoint is not connected
Exception ignored in: <function Pool.del at 0x7cf9930e17e0>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 271, in del
File "/usr/lib/python3.10/multiprocessing/queues.py", line 371, in put
AttributeError: 'NoneType' object has no attribute 'dumps'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant