You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[18:28:59] Insecure federated server listening on 0.0.0.0:9091, world size 5
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:04] [0] eval-logloss:0.22646 train-logloss:0.23316
[18:29:06] [1] eval-logloss:0.13776 train-logloss:0.13654
[18:29:07] [2] eval-logloss:0.08036 train-logloss:0.08243
[18:29:09] [3] eval-logloss:0.05830 train-logloss:0.05645
[18:29:11] [4] eval-logloss:0.03825 train-logloss:0.04148
[18:29:12] [5] eval-logloss:0.02660 train-logloss:0.02958
[18:29:14] [6] eval-logloss:0.01386 train-logloss:0.01918
[18:29:16] [7] eval-logloss:0.01018 train-logloss:0.01331
[18:29:17] [8] eval-logloss:0.00847 train-logloss:0.01112
[18:29:19] [9] eval-logloss:0.00691 train-logloss:0.00662
[18:29:21] [10] eval-logloss:0.00543 train-logloss:0.00503
[18:29:23] [11] eval-logloss:0.00445 train-logloss:0.00420
[18:29:24] [12] eval-logloss:0.00336 train-logloss:0.00355
[18:29:26] [13] eval-logloss:0.00277 train-logloss:0.00280
[18:29:28] [14] eval-logloss:0.00252 train-logloss:0.00244
[18:29:30] [15] eval-logloss:0.00177 train-logloss:0.00193
[18:29:31] [16] eval-logloss:0.00156 train-logloss:0.00161
[18:29:33] [17] eval-logloss:0.00135 train-logloss:0.00142
[18:29:35] [18] eval-logloss:0.00123 train-logloss:0.00125
[18:29:37] [19] eval-logloss:0.00106 train-logloss:0.00107
[18:29:37] Finished training
⚠️ ❗ However, after upgrading to XGBoost 2.1.0, the same code results in the following:
[18:33:03] Insecure federated server listening on 0.0.0.0:5, world size 9091
E0628 18:33:03.598596287 331756 chttp2_server.cc:1053] UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:5' {file:"/grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc", file_line:963, created_time:"2024-06-28T18:33:03.597969893+02:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-06-28T18:33:03.597933892+02:00", file_line:363, file:"/grpc/src/core/lib/iomgr/tcp_server_posix.cc", children:[UNKNOWN:Unable to configure socket {fd:8, created_time:"2024-06-28T18:33:03.597880883+02:00", file_line:220, file:"/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc", children:[UNKNOWN:Permission denied {syscall:"bind", os_error:"Permission denied", errno:13, created_time:"2024-06-28T18:33:03.597840858+02:00", file_line:194, file:"/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc"}]}, UNKNOWN:Unable to configure socket {file:"/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc", file_line:220, created_time:"2024-06-28T18:33:03.597922845+02:00", fd:8, children:[UNKNOWN:Permission denied {file:"/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc", file_line:194, created_time:"2024-06-28T18:33:03.597916832+02:00", errno:13, os_error:"Permission denied", syscall:"bind"}]}]}]}
[18:33:04] Rank 0
Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File ".../refs/test_federated.py", line 36, in run_worker
with xgb.collective.CommunicatorContext(**communicator_env):
File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
assert is_distributed()
AssertionError
[18:33:04] Rank 0
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File ".../refs/test_federated.py", line 36, in run_worker
with xgb.collective.CommunicatorContext(**communicator_env):
File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
assert is_distributed()
AssertionError
[18:33:04] Rank 0
Process Process-4:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File ".../refs/test_federated.py", line 36, in run_worker
with xgb.collective.CommunicatorContext(**communicator_env):
File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
assert is_distributed()
AssertionError
[18:33:04] Rank 0
Process Process-5:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File ".../refs/test_federated.py", line 36, in run_worker
with xgb.collective.CommunicatorContext(**communicator_env):
File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
assert is_distributed()
AssertionError
[18:33:04] Rank 0
Process Process-6:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File ".../refs/test_federated.py", line 36, in run_worker
with xgb.collective.CommunicatorContext(**communicator_env):
File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
assert is_distributed()
AssertionError
For reference, the adapted test_federated.py is:
#!/usr/bin/pythonimportmultiprocessingimportsysimporttimeimportxgboostasxgbimportxgboost.federatedSERVER_KEY='server-key.pem'SERVER_CERT='server-cert.pem'CLIENT_KEY='client-key.pem'CLIENT_CERT='client-cert.pem'defrun_server(port: int, world_size: int, with_ssl: bool) ->None:
ifwith_ssl:
xgboost.federated.run_federated_server(port, world_size, SERVER_KEY, SERVER_CERT,
CLIENT_CERT)
else:
xgboost.federated.run_federated_server(port, world_size)
defrun_worker(port: int, world_size: int, rank: int, with_ssl: bool, with_gpu: bool) ->None:
communicator_env= {
'xgboost_communicator': 'federated',
'federated_server_address': f'localhost:{port}',
'federated_world_size': world_size,
'federated_rank': rank
}
ifwith_ssl:
communicator_env['federated_server_cert'] =SERVER_CERTcommunicator_env['federated_client_key'] =CLIENT_KEYcommunicator_env['federated_client_cert'] =CLIENT_CERT# Always call this before using distributed modulewithxgb.collective.CommunicatorContext(**communicator_env):
# Load file, file will not be sharded in federated mode.dtrain=xgb.DMatrix('agaricus.txt.train-%02d?format=libsvm'%rank)
dtest=xgb.DMatrix('agaricus.txt.test-%02d?format=libsvm'%rank)
# Specify parameters via map, definition are same as c++ versionparam= {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}
ifwith_gpu:
param['tree_method'] ='hist'param['device'] =f"cuda:{rank}"# Specify validations set to watch performancewatchlist= [(dtest, 'eval'), (dtrain, 'train')]
num_round=20# Run training, all the features in training API is available.bst=xgb.train(param, dtrain, num_round, evals=watchlist,
early_stopping_rounds=2)
# Save the model, only ask process 0 to save the model.ifxgb.collective.get_rank() ==0:
bst.save_model("test.model.json")
xgb.collective.communicator_print("Finished training\n")
defrun_federated(with_ssl: bool=True, with_gpu: bool=False) ->None:
port=9091world_size=int(sys.argv[1])
server=multiprocessing.Process(target=run_server, args=(port, world_size, with_ssl))
server.start()
time.sleep(1)
ifnotserver.is_alive():
raiseException("Error starting Federated Learning server")
workers= []
forrankinrange(world_size):
worker=multiprocessing.Process(target=run_worker,
args=(port, world_size, rank, with_ssl, with_gpu))
workers.append(worker)
worker.start()
forworkerinworkers:
worker.join()
server.terminate()
if__name__=='__main__':
run_federated(with_ssl=False, with_gpu=False)
And the adapted shell script:
#!/bin/bash# world_size=$(nvidia-smi -L | wc -l)
world_size=$1# Split train and test files manually to simulate a federated environment.
split -n l/"${world_size}" -d agaricus.txt.train agaricus.txt.train-
split -n l/"${world_size}" -d agaricus.txt.test agaricus.txt.test-
python test_federated.py "${world_size}"
🚧 ⌛ Interim solution: Downgrade to XGBoost 2.0.0
The text was updated successfully, but these errors were encountered:
Hi,
I am running a code based on this FL test code: https://github.com/dmlc/xgboost/tree/master/tests/test_distributed/test_federated
Using XGBoost 2.0.0 with
world_size=5
:For reference, the adapted
test_federated.py
is:And the adapted shell script:
🚧 ⌛ Interim solution: Downgrade to XGBoost 2.0.0
The text was updated successfully, but these errors were encountered: