Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Federated Learning not working with 2.1.0 #10500

Closed
gubertoli opened this issue Jun 28, 2024 · 1 comment · Fixed by #10503
Closed

Federated Learning not working with 2.1.0 #10500

gubertoli opened this issue Jun 28, 2024 · 1 comment · Fixed by #10503

Comments

@gubertoli
Copy link

Hi,
I am running a code based on this FL test code: https://github.com/dmlc/xgboost/tree/master/tests/test_distributed/test_federated

Using XGBoost 2.0.0 with world_size=5:

[18:28:59] Insecure federated server listening on 0.0.0.0:9091, world size 5
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:00] XGBoost federated mode detected, not splitting data among workers
[18:29:04] [0]  eval-logloss:0.22646    train-logloss:0.23316
[18:29:06] [1]  eval-logloss:0.13776    train-logloss:0.13654
[18:29:07] [2]  eval-logloss:0.08036    train-logloss:0.08243
[18:29:09] [3]  eval-logloss:0.05830    train-logloss:0.05645
[18:29:11] [4]  eval-logloss:0.03825    train-logloss:0.04148
[18:29:12] [5]  eval-logloss:0.02660    train-logloss:0.02958
[18:29:14] [6]  eval-logloss:0.01386    train-logloss:0.01918
[18:29:16] [7]  eval-logloss:0.01018    train-logloss:0.01331
[18:29:17] [8]  eval-logloss:0.00847    train-logloss:0.01112
[18:29:19] [9]  eval-logloss:0.00691    train-logloss:0.00662
[18:29:21] [10] eval-logloss:0.00543    train-logloss:0.00503
[18:29:23] [11] eval-logloss:0.00445    train-logloss:0.00420
[18:29:24] [12] eval-logloss:0.00336    train-logloss:0.00355
[18:29:26] [13] eval-logloss:0.00277    train-logloss:0.00280
[18:29:28] [14] eval-logloss:0.00252    train-logloss:0.00244
[18:29:30] [15] eval-logloss:0.00177    train-logloss:0.00193
[18:29:31] [16] eval-logloss:0.00156    train-logloss:0.00161
[18:29:33] [17] eval-logloss:0.00135    train-logloss:0.00142
[18:29:35] [18] eval-logloss:0.00123    train-logloss:0.00125
[18:29:37] [19] eval-logloss:0.00106    train-logloss:0.00107
[18:29:37] Finished training

⚠️ ❗ However, after upgrading to XGBoost 2.1.0, the same code results in the following:

[18:33:03] Insecure federated server listening on 0.0.0.0:5, world size 9091
E0628 18:33:03.598596287  331756 chttp2_server.cc:1053]      UNKNOWN:No address added out of total 1 resolved for '0.0.0.0:5' {file:"/grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc", file_line:963, created_time:"2024-06-28T18:33:03.597969893+02:00", children:[UNKNOWN:Failed to add any wildcard listeners {created_time:"2024-06-28T18:33:03.597933892+02:00", file_line:363, file:"/grpc/src/core/lib/iomgr/tcp_server_posix.cc", children:[UNKNOWN:Unable to configure socket {fd:8, created_time:"2024-06-28T18:33:03.597880883+02:00", file_line:220, file:"/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc", children:[UNKNOWN:Permission denied {syscall:"bind", os_error:"Permission denied", errno:13, created_time:"2024-06-28T18:33:03.597840858+02:00", file_line:194, file:"/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc"}]}, UNKNOWN:Unable to configure socket {file:"/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc", file_line:220, created_time:"2024-06-28T18:33:03.597922845+02:00", fd:8, children:[UNKNOWN:Permission denied {file:"/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc", file_line:194, created_time:"2024-06-28T18:33:03.597916832+02:00", errno:13, os_error:"Permission denied", syscall:"bind"}]}]}]}
[18:33:04] Rank 0
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File ".../refs/test_federated.py", line 36, in run_worker
    with xgb.collective.CommunicatorContext(**communicator_env):
  File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
    assert is_distributed()
AssertionError
[18:33:04] Rank 0
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File ".../refs/test_federated.py", line 36, in run_worker
    with xgb.collective.CommunicatorContext(**communicator_env):
  File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
    assert is_distributed()
AssertionError
[18:33:04] Rank 0
Process Process-4:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File ".../refs/test_federated.py", line 36, in run_worker
    with xgb.collective.CommunicatorContext(**communicator_env):
  File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
    assert is_distributed()
AssertionError
[18:33:04] Rank 0
Process Process-5:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File ".../refs/test_federated.py", line 36, in run_worker
    with xgb.collective.CommunicatorContext(**communicator_env):
  File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
    assert is_distributed()
AssertionError
[18:33:04] Rank 0
Process Process-6:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File ".../refs/test_federated.py", line 36, in run_worker
    with xgb.collective.CommunicatorContext(**communicator_env):
  File ".../.venv/lib/python3.10/site-packages/xgboost/collective.py", line 280, in __enter__
    assert is_distributed()
AssertionError

For reference, the adapted test_federated.py is:

#!/usr/bin/python
import multiprocessing
import sys
import time

import xgboost as xgb
import xgboost.federated

SERVER_KEY = 'server-key.pem'
SERVER_CERT = 'server-cert.pem'
CLIENT_KEY = 'client-key.pem'
CLIENT_CERT = 'client-cert.pem'


def run_server(port: int, world_size: int, with_ssl: bool) -> None:
    if with_ssl:
        xgboost.federated.run_federated_server(port, world_size, SERVER_KEY, SERVER_CERT,
                                               CLIENT_CERT)
    else:
        xgboost.federated.run_federated_server(port, world_size)


def run_worker(port: int, world_size: int, rank: int, with_ssl: bool, with_gpu: bool) -> None:
    communicator_env = {
        'xgboost_communicator': 'federated',
        'federated_server_address': f'localhost:{port}',
        'federated_world_size': world_size,
        'federated_rank': rank
    }
    if with_ssl:
        communicator_env['federated_server_cert'] = SERVER_CERT
        communicator_env['federated_client_key'] = CLIENT_KEY
        communicator_env['federated_client_cert'] = CLIENT_CERT

    # Always call this before using distributed module
    with xgb.collective.CommunicatorContext(**communicator_env):
        # Load file, file will not be sharded in federated mode.
        dtrain = xgb.DMatrix('agaricus.txt.train-%02d?format=libsvm' % rank)
        dtest = xgb.DMatrix('agaricus.txt.test-%02d?format=libsvm' % rank)

        # Specify parameters via map, definition are same as c++ version
        param = {'max_depth': 2, 'eta': 1, 'objective': 'binary:logistic'}
        if with_gpu:
            param['tree_method'] = 'hist'
            param['device'] = f"cuda:{rank}"

        # Specify validations set to watch performance
        watchlist = [(dtest, 'eval'), (dtrain, 'train')]
        num_round = 20

        # Run training, all the features in training API is available.
        bst = xgb.train(param, dtrain, num_round, evals=watchlist,
                        early_stopping_rounds=2)

        # Save the model, only ask process 0 to save the model.
        if xgb.collective.get_rank() == 0:
            bst.save_model("test.model.json")
            xgb.collective.communicator_print("Finished training\n")


def run_federated(with_ssl: bool = True, with_gpu: bool = False) -> None:
    port = 9091
    world_size = int(sys.argv[1])

    server = multiprocessing.Process(target=run_server, args=(port, world_size, with_ssl))
    server.start()
    time.sleep(1)
    if not server.is_alive():
        raise Exception("Error starting Federated Learning server")

    workers = []
    for rank in range(world_size):
        worker = multiprocessing.Process(target=run_worker,
                                         args=(port, world_size, rank, with_ssl, with_gpu))
        workers.append(worker)
        worker.start()
    for worker in workers:
        worker.join()
    server.terminate()


if __name__ == '__main__':
    run_federated(with_ssl=False, with_gpu=False)

And the adapted shell script:

#!/bin/bash

# world_size=$(nvidia-smi -L | wc -l)
world_size=$1

# Split train and test files manually to simulate a federated environment.
split -n l/"${world_size}" -d agaricus.txt.train agaricus.txt.train-
split -n l/"${world_size}" -d agaricus.txt.test agaricus.txt.test-

python test_federated.py "${world_size}"

🚧 ⌛ Interim solution: Downgrade to XGBoost 2.0.0

@gubertoli
Copy link
Author

gubertoli commented Aug 18, 2024

@trivialfis I tested with 2.1.0 and 2.1.1 and the issue persists: #10716

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant