Skip to content

Commit

Permalink
Support FATE v1.11.2 (#898)
Browse files Browse the repository at this point in the history
* Update version tag

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* update FATE config

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* Add all algorithm adaptations add fix container permissions issue

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* Fix docker_deploy.sh --delete when serving_ip_list does not exist

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* Update doc of docker compose

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* Update chart support llm

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* Update fluentd to fluent-bit

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* Update docs,
fixed #892

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* Fix spark image suffix

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* Remove the LLM tag

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

* add volume for llm

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>

---------

Signed-off-by: Chenlong Ma <chenlongm@vmware.com>
  • Loading branch information
owlet42 authored Jul 3, 2023
1 parent 6bd9ddc commit e552f5a
Show file tree
Hide file tree
Showing 62 changed files with 667 additions and 331 deletions.
2 changes: 1 addition & 1 deletion docker-deploy/.env
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
RegistryURI=
TAG=1.11.1-release
TAG=1.11.2-release
SERVING_TAG=2.1.6-release
SSH_PORT=22

Expand Down
37 changes: 23 additions & 14 deletions docker-deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The nodes (target nodes) to install FATE must meet the following requirements:
2. Docker: 19.03.0+
3. Docker Compose: 1.27.0+
4. The deployment machine have access to the Internet, so the hosts can communicate with each other;
5. Network connection to Internet to pull container images from Docker Hub. If network connection to Internet is not available, consider to set up [Harbor as a local registry](../registry/README.md) or use [offline images](https://github.com/FederatedAI/FATE/tree/master/build/docker-build).
5. Network connection to Internet to pull container images from Docker Hub. If network connection to Internet is not available, consider to set up [Harbor as a local registry](../registry/README.md) or use [offline images](https://github.com/FederatedAI/FATE-Builder/tree/main/docker-build).
6. A host running FATE is recommended to be with 8 CPUs and 16G RAM.

## Deploying FATE
Expand Down Expand Up @@ -175,21 +175,30 @@ bash ./docker_deploy.sh 10000
bash ./docker_deploy.sh exchange
```

Once the commands finish, log in to any host and use `docker ps` to verify the status of the cluster. A sample output is as follows:
Once the commands finish, log in to any host and use `docker compose ps` to verify the status of the cluster. A sample output is as follows:

```bash
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5d2e84ba4c77 federatedai/serving-server:2.1.5-release "/bin/sh -c 'java -c…" 5 minutes ago Up 5 minutes 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp serving-9999_serving-server_1
3dca43f3c9d5 federatedai/serving-admin:2.1.5-release "/bin/sh -c 'java -c…" 5 minutes ago Up 5 minutes 0.0.0.0:8350->8350/tcp, :::8350->8350/tcp serving-9999_serving-admin_1
fe924918509b federatedai/serving-proxy:2.1.5-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8059->8059/tcp, :::8059->8059/tcp, 0.0.0.0:8869->8869/tcp, :::8869->8869/tcp, 8879/tcp serving-9999_serving-proxy_1
b62ed8ba42b7 bitnami/zookeeper:3.7.0 "/opt/bitnami/script…" 5 minutes ago Up 5 minutes 0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 8080/tcp, 0.0.0.0:49226->2888/tcp, :::49226->2888/tcp, 0.0.0.0:49225->3888/tcp, :::49225->3888/tcp serving-9999_serving-zookeeper_1
3c643324066f federatedai/client:1.11.1-release "/bin/sh -c 'flow in…" 5 minutes ago Up 5 minutes 0.0.0.0:20000->20000/tcp, :::20000->20000/tcp confs-9999_client_1
3fe0af1ebd71 federatedai/fateboard:1.11.1-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp confs-9999_fateboard_1
635b7d99357e federatedai/fateflow:1.11.1-release "container-entrypoin…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:9360->9360/tcp, :::9360->9360/tcp, 8080/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp confs-9999_fateflow_1
8b515f08add3 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 8080/tcp, 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp confs-9999_rollsite_1
108cc061c191 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4670/tcp, 8080/tcp confs-9999_clustermanager_1
f10575e76899 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4671/tcp, 8080/tcp confs-9999_nodemanager_1
aa0a0002de93 mysql:8.0.28 "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 3306/tcp, 33060/tcp confs-9999_mysql_1
ssh fate@192.168.7.1
```

Verify the instance status using the following command,

```bash
cd /data/projects/fate/confs-10000
docker compose ps
````

The output is shown as follows. If the status of each component is `Up`, and the status of fateflow is still (healthy), it means that the deployment is successful.

```bash
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
confs-10000-client-1 federatedai/client:1.11.2-release "bash -c 'pipeline i…" client About a minute ago Up About a minute 0.0.0.0:20000->20000/tcp, :::20000->20000/tcp
confs-10000-clustermanager-1 federatedai/eggroll:1.11.2-release "/tini -- bash -c 'j…" clustermanager About a minute ago Up About a minute 4670/tcp
confs-10000-fateboard-1 federatedai/fateboard:1.11.2-release "/bin/sh -c 'java -D…" fateboard About a minute ago Up About a minute 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
confs-10000-fateflow-1 federatedai/fateflow:1.11.2-release "/bin/bash -c 'set -…" fateflow About a minute ago Up About a minute (healthy) 0.0.0.0:9360->9360/tcp, :::9360->9360/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp
confs-10000-mysql-1 mysql:8.0.28 "docker-entrypoint.s…" mysql About a minute ago Up About a minute 3306/tcp, 33060/tcp
confs-10000-nodemanager-1 federatedai/eggroll:1.11.2-release "/tini -- bash -c 'j…" nodemanager About a minute ago Up About a minute 4671/tcp
confs-10000-rollsite-1 federatedai/eggroll:1.11.2-release "/tini -- bash -c 'j…" rollsite About a minute ago Up About a minute 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp
```

### Verifying the deployment
Expand Down
82 changes: 58 additions & 24 deletions docker-deploy/README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Compose是用于定义和运行多容器Docker应用程序的工具。通过Comp
2. 所有主机安装Docker 版本 : 19.03.0+;
3. 所有主机安装Docker Compose 版本: 1.27.0+;
4. 部署机可以联网,所以主机相互之间可以网络互通;
5. 运行机已经下载FATE的各组件镜像,如果无法连接dockerhub,请考虑使用harbor([Harbor 作为本地镜像源](../registry/README.md))或者使用离线部署(离线构建镜像参考文档[构建镜像](https://github.com/FederatedAI/FATE/tree/master/build/docker-build))。
5. 运行机已经下载FATE的各组件镜像,如果无法连接dockerhub,请考虑使用harbor([Harbor 作为本地镜像源](../registry/README.md))或者使用离线部署(离线构建镜像参考文档[构建镜像]( https://github.com/FederatedAI/FATE-Builder/tree/main/docker-build))。
6. 运行FATE的主机推荐配置8CPUs和16G RAM。

### 下载部署脚本
Expand Down Expand Up @@ -171,44 +171,73 @@ FATE GPU的使用只有fateflow组件,所以每个Party最少需要有一个GP

### 执行部署脚本

**注意:**在运行以下命令之前,所有目标主机必须

* 允许使用 SSH 密钥进行无密码 SSH 访问(否则我们将需要为每个主机多次输入密码)。
* 满足 [准备工作](#准备工作) 中指定的要求。

要将 FATE 部署到所有已配置的目标主机,请使用以下命令:

以下修改可在任意机器执行。

进入目录`kubeFATE\docker-deploy`,然后运行:

```bash
bash ./generate_config.sh # 生成部署文件
bash ./docker_deploy.sh all # 在各个party上部署FATE
```

脚本将会生成10000、9999两个组织(Party)的部署文件,然后打包成tar文件。接着把tar文件`confs-<party-id>.tar``serving-<party-id>.tar`分别复制到party对应的主机上并解包,解包后的文件默认在`/data/projects/fate`目录下。然后脚本将远程登录到这些主机并使用docker compose命令启动FATE实例。

命令成功执行返回后,登录其中任意一个主机:
默认情况下,脚本会同时启动训练和服务集群。 如果您需要单独启动它们,请将 `--training``--serving` 添加到 `docker_deploy.sh` 中,如下所示。

(可选)要部署各方训练集群,请使用以下命令:

```bash
bash ./docker_deploy.sh all --training
```

(可选)要部署各方服务集群,请使用以下命令:

```bash
bash ./docker_deploy.sh all --serving
```

(可选)要将 FATE 部署到单个目标主机,请使用以下命令和参与方的 ID(下例中为 10000):

```bash
bash ./docker_deploy.sh 10000
```

(可选)要将交换节点部署到目标主机,请使用以下命令:

```bash
ssh root@192.168.7.1
bash ./docker_deploy.sh exchange
```

命令完成后,登录到任何主机并使用 `docker compose ps` 来验证集群的状态。 示例输出如下:

```bash
ssh fate@192.168.7.1
```

使用以下命令验证实例状态,

```bash
docker ps
````
cd /data/projects/fate/confs-10000
docker compose ps
```

输出显示如下,若各个组件都是运行(up)状态,说明部署成功。
输出显示如下,若各个组件状态都是`Up`状态,并且fateflow的状态还是(healthy),说明部署成功。

```bash
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5d2e84ba4c77 federatedai/serving-server:2.1.5-release "/bin/sh -c 'java -c…" 5 minutes ago Up 5 minutes 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp serving-9999_serving-server_1
3dca43f3c9d5 federatedai/serving-admin:2.1.5-release "/bin/sh -c 'java -c…" 5 minutes ago Up 5 minutes 0.0.0.0:8350->8350/tcp, :::8350->8350/tcp serving-9999_serving-admin_1
fe924918509b federatedai/serving-proxy:2.1.5-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8059->8059/tcp, :::8059->8059/tcp, 0.0.0.0:8869->8869/tcp, :::8869->8869/tcp, 8879/tcp serving-9999_serving-proxy_1
b62ed8ba42b7 bitnami/zookeeper:3.7.0 "/opt/bitnami/script…" 5 minutes ago Up 5 minutes 0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 8080/tcp, 0.0.0.0:49226->2888/tcp, :::49226->2888/tcp, 0.0.0.0:49225->3888/tcp, :::49225->3888/tcp serving-9999_serving-zookeeper_1
3c643324066f federatedai/client:1.11.1-release "/bin/sh -c 'flow in…" 5 minutes ago Up 5 minutes 0.0.0.0:20000->20000/tcp, :::20000->20000/tcp confs-9999_client_1
3fe0af1ebd71 federatedai/fateboard:1.11.1-release "/bin/sh -c 'java -D…" 5 minutes ago Up 5 minutes 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp confs-9999_fateboard_1
635b7d99357e federatedai/fateflow:1.11.1-release "container-entrypoin…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:9360->9360/tcp, :::9360->9360/tcp, 8080/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp confs-9999_fateflow_1
8b515f08add3 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 8080/tcp, 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp confs-9999_rollsite_1
108cc061c191 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4670/tcp, 8080/tcp confs-9999_clustermanager_1
f10575e76899 federatedai/eggroll:1.11.1-release "/tini -- bash -c 'j…" 5 minutes ago Up 5 minutes 4671/tcp, 8080/tcp confs-9999_nodemanager_1
aa0a0002de93 mysql:8.0.28 "docker-entrypoint.s…" 5 minutes ago Up 5 minutes 3306/tcp, 33060/tcp confs-9999_mysql_1
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
confs-10000-client-1 federatedai/client:1.11.2-release "bash -c 'pipeline i…" client About a minute ago Up About a minute 0.0.0.0:20000->20000/tcp, :::20000->20000/tcp
confs-10000-clustermanager-1 federatedai/eggroll:1.11.2-release "/tini -- bash -c 'j…" clustermanager About a minute ago Up About a minute 4670/tcp
confs-10000-fateboard-1 federatedai/fateboard:1.11.2-release "/bin/sh -c 'java -D…" fateboard About a minute ago Up About a minute 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
confs-10000-fateflow-1 federatedai/fateflow:1.11.2-release "/bin/bash -c 'set -…" fateflow About a minute ago Up About a minute (healthy) 0.0.0.0:9360->9360/tcp, :::9360->9360/tcp, 0.0.0.0:9380->9380/tcp, :::9380->9380/tcp
confs-10000-mysql-1 mysql:8.0.28 "docker-entrypoint.s…" mysql About a minute ago Up About a minute 3306/tcp, 33060/tcp
confs-10000-nodemanager-1 federatedai/eggroll:1.11.2-release "/tini -- bash -c 'j…" nodemanager About a minute ago Up About a minute 4671/tcp
confs-10000-rollsite-1 federatedai/eggroll:1.11.2-release "/tini -- bash -c 'j…" rollsite About a minute ago Up About a minute 0.0.0.0:9370->9370/tcp, :::9370->9370/tcp
```

### 验证部署
Expand All @@ -218,9 +247,12 @@ docker-compose上的FATE启动成功之后需要验证各个服务是否都正
选择192.168.7.1这个节点验证,使用以下命令验证:

```bash
#在192.168.7.1上执行下列命令
$ docker exec -it confs-10000_client_1 bash #进入client组件容器内部
$ flow test toy --guest-party-id 10000 --host-party-id 9999 #验证
# 在192.168.7.1上执行下列命令

# 进入client组件容器内部
$ docker compose exec client bash
# toy 验证
$ flow test toy --guest-party-id 10000 --host-party-id 9999
```

如果测试通过,屏幕将显示类似如下消息:
Expand All @@ -243,7 +275,8 @@ $ flow test toy --guest-party-id 10000 --host-party-id 9999 #验证
##### 进入party10000 client容器

```bash
docker exec -it confs-10000_client_1 bash
cd /data/projects/fate/confs-10000
docker compose exec client bash
```

##### 上传host数据
Expand All @@ -257,7 +290,8 @@ flow data upload -c fateflow/examples/upload/upload_host.json
##### 进入party9999 client容器

```bash
docker exec -it confs-9999_client_1 bash
cd /data/projects/fate/confs-9999
docker compose exec client bash
```

##### 上传guest数据
Expand Down
12 changes: 12 additions & 0 deletions docker-deploy/docker_deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,9 @@ cd confs-$target_party_id
docker compose down
docker volume rm -f confs-${target_party_id}_shared_dir_examples
docker volume rm -f confs-${target_party_id}_shared_dir_federatedml
docker volume rm -f confs-${target_party_id}_sdownload_dir
docker volume rm -f confs-${target_party_id}_fate_flow_logs
docker compose up -d
cd ../
rm -f confs-${target_party_id}.tar
Expand Down Expand Up @@ -239,13 +242,18 @@ DeleteCluster() {
fi
done
fi

# echo "target_party_ip: $target_party_ip"

for ((i = 0; i < ${#party_list[*]}; i++)); do
if [ "${party_list[$i]}" = "$target_party_id" ]; then
target_party_serving_ip=${serving_ip_list[$i]}
fi
done

# echo "target_party_ip: $target_party_ip"
# echo "cluster_type: $cluster_type"

# delete training cluster
if [ "$cluster_type" == "--training" ]; then
ssh -p ${SSH_PORT} -tt $user@$target_party_ip <<eeooff
Expand All @@ -272,16 +280,20 @@ docker compose down
exit
eeooff
else
if [ "$target_party_serving_ip" != "" ]; then
ssh -p ${SSH_PORT} -tt $user@$target_party_serving_ip <<eeooff
cd $dir/serving-$target_party_id
docker compose down
exit
eeooff
fi
if [ "$target_party_ip" != "" ]; then
ssh -p ${SSH_PORT} -tt $user@$target_party_ip <<eeooff
cd $dir/confs-$target_party_id
docker compose down
exit
eeooff
fi
echo "party $target_party_id training cluster is deleted!"
echo "party $target_party_id serving cluster is deleted!"
fi
Expand Down
Loading

0 comments on commit e552f5a

Please sign in to comment.