Skip to content

Latest commit

 

History

History
280 lines (233 loc) · 19.7 KB

File metadata and controls

280 lines (233 loc) · 19.7 KB

前提

本kubeflow workshop 需要的软件环境有 aws cli , eksctl ,kubectl,e aws-iam-authenticator以及eks对应操作的IAM权限。具体安装步骤请参考2020_EKS_Launch_Workshop/步骤1-准备实验环境。

本workshop选择了缺省的kfctl配置文件,以简化kubeflow安装。但是,我们建议在生产环境中安装Cognito配置并添加身份验证和SSL(通过ACM)。有关启用Cognito所需的其他步骤,请参阅Kubeflow documentation文档

设置环境变量

执行以下命令来设置环境变量,然后为这个部署设置Kubeflow应用程序目录,在CLUSTER_NAME指定您的EKS集群名称

export REGION=cn-northwest-1
export CLUSTER_NAME=eks
export BASE_DIR=$(pwd)
export KF_NAME=${CLUSTER_NAME}
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_FILE=${KF_DIR}/kfctl_aws.yaml

使用eksctl创建EKS集群

由于Kubeflow需要较多的资源来部署,通过执行以下操作创建一个6个工作节点EKS集群,大约需要15分钟,请耐心等待。

eksctl create cluster --name=${CLUSTER_NAME} --nodes=4 --managed --alb-ingress-access --region=${REGION}

获取EKS 工作节点role,配置环境变量用于后续使用

export STACK_NAME=$(eksctl get nodegroup --cluster $CLUSTER_NAME --region $REGION  -o json | jq -r '.[].StackName')
export NODE_INSTANCE_ROLE=$(aws cloudformation describe-stack-resources --region $REGION --stack-name $STACK_NAME | jq -r '.StackResources[] | select(.LogicalResourceId=="NodeInstanceRole") | .PhysicalResourceId' ) 

安装kubeflow

下载并安装kfctl

curl --silent --location "https://github.com/kubeflow/kfctl/releases/download/v1.0.1/kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz" | tar xz -C /tmp
sudo mv -v /tmp/kfctl /usr/local/bin

配置和下载kubeflow文件,本实验使用非cognito版本,默认不进行身份验证。(注意:如果存在文件不能下载的问题,可以尝试重新运行wget -O ${KF_DIR}/kfctl_aws.yaml $CONFIG_URI)

export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_aws.v1.0.1.yaml"
export KF_NAME=${CLUSTER_NAME}
mkdir -p ${BASE_DIR}
export KF_DIR=${BASE_DIR}/${KF_NAME}
mkdir -p ${KF_DIR}
wget -O ${KF_DIR}/kfctl_aws.yaml $CONFIG_URI
export CONFIG_FILE=${KF_DIR}/kfctl_aws.yaml

替换kfctl_aws.yaml中的region和role为当前的创建eks的region和node节点使用的role

sed -i'.bak' ${CONFIG_FILE}
sed -i -e "s/eksctl-kubeflow-aws-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx/$NODE_INSTANCE_ROLE/g" ${CONFIG_FILE}
sed -i -e 's/us-west-2/'"$REGION"'/' ${CONFIG_FILE}

检查kfctl_aws.yaml是否正确替换

region: cn-northwest-1
roles:
- eksctl-kubeflow-example-nodegroup-ng-185-NodeInstanceRole-1DDJJXQBG9EM6

kfclt 本质上是使用了 kustomize 来安装,通过kfctl build生成kubeflow kustomize配置文件

kfctl build -f ${CONFIG_FILE}

由于防火墙或安全限制,海外gcr.io, quay.io的镜像可能无法下载,需要通过修改镜像的方式安装,把镜像url替换成aws国内镜像站点url:

sed -i "s/gcr.io/048912060910.dkr.ecr.cn-northwest-1.amazonaws.com.cn\/gcr/g" `grep "gcr.io" -rl ${KF_DIR}`
sed -i "s/quay.io/048912060910.dkr.ecr.cn-northwest-1.amazonaws.com.cn\/quay/g" `grep "quay.io" -rl ${KF_DIR}`

开始部署kubeflow

kfctl apply -V -f ${CONFIG_FILE}

安装Kubeflow及其工具集可能需要数分钟。有一些的pod最初可能会出现Error或CrashLoopBackOff状态。需要一些时间,它们会自动修复,并进入运行状态

验证kubeflow是否成功部署

运行下面的命令检查状态

kubectl -n kubeflow get all

如果一段时间后仍不正常,请通过查看日志进行故障排除

状态信息
$ kubectl -n kubeflow get all                         
NAME                                                               READY   STATUS      RESTARTS   AGE
pod/admission-webhook-bootstrap-stateful-set-0                     1/1     Running     0          5m50s
pod/admission-webhook-deployment-64cb96ddbf-x2zfm                  1/1     Running     0          5m12s
pod/alb-ingress-controller-c76dd95d-z2kc7                          1/1     Running     0          5m45s
pod/application-controller-stateful-set-0                          1/1     Running     0          6m32s
pod/argo-ui-778676df64-w4lpj                                       1/1     Running     0          5m51s
pod/centraldashboard-7dd7dd685d-fjnr2                              1/1     Running     0          5m51s
pod/jupyter-web-app-deployment-89789fd5-pmwmf                      1/1     Running     0          5m50s
pod/katib-controller-6b789b6cb5-rc7xz                              1/1     Running     1          5m48s
pod/katib-db-manager-64f548b47c-6p6nv                              1/1     Running     0          5m48s
pod/katib-mysql-57884cb488-6g9zk                                   1/1     Running     0          5m48s
pod/katib-ui-5c5cc6bd77-mwmrl                                      1/1     Running     0          5m48s
pod/metacontroller-0                                               1/1     Running     0          5m51s
pod/metadata-db-76c9f78f77-pjvh8                                   1/1     Running     0          5m49s
pod/metadata-deployment-674fdd976b-946k6                           1/1     Running     0          5m49s
pod/metadata-envoy-deployment-5688989bd6-j5bdh                     1/1     Running     0          5m49s
pod/metadata-grpc-deployment-5579bdc87b-fc88k                      1/1     Running     2          5m49s
pod/metadata-ui-9b8cd699d-drm2p                                    1/1     Running     0          5m49s
pod/minio-755ff748b-hdfwk                                          1/1     Running     0          5m47s
pod/ml-pipeline-79b4f85cbc-hcttq                                   1/1     Running     5          5m47s
pod/ml-pipeline-ml-pipeline-visualizationserver-5fdffdc5bf-nqjb5   1/1     Running     0          5m46s
pod/ml-pipeline-persistenceagent-645cb66874-rgrt4                  1/1     Running     1          5m47s
pod/ml-pipeline-scheduledworkflow-6c978b6b85-dxgw4                 1/1     Running     0          5m46s
pod/ml-pipeline-ui-6995b7bccf-ktwb2                                1/1     Running     0          5m47s
pod/ml-pipeline-viewer-controller-deployment-8554dc7b9f-n4ccc      1/1     Running     0          5m46s
pod/mpi-operator-5bf8b566b7-gkbz9                                  1/1     Running     0          5m45s
pod/mysql-598bc897dc-srtpt                                         1/1     Running     0          5m47s
pod/notebook-controller-deployment-7db57b9ccf-4pqkw                1/1     Running     0          5m49s
pod/nvidia-device-plugin-daemonset-4s9tv                           1/1     Running     0          5m46s
pod/nvidia-device-plugin-daemonset-5p8kn                           1/1     Running     0          5m46s
pod/nvidia-device-plugin-daemonset-84jv6                           1/1     Running     0          5m46s
pod/nvidia-device-plugin-daemonset-d7x5f                           1/1     Running     0          5m46s
pod/nvidia-device-plugin-daemonset-m8cpr                           1/1     Running     0          5m46s
pod/profiles-deployment-b45dbc6f-7jfqw                             2/2     Running     0          5m46s
pod/pytorch-operator-5fd5f94bdd-dbddk                              1/1     Running     0          5m49s
pod/seldon-controller-manager-679fc777cd-58vzl                     1/1     Running     0          5m45s
pod/spark-operatorcrd-cleanup-tc4nw                                0/2     Completed   0          5m50s
pod/spark-operatorsparkoperator-c7b64b87f-w6glw                    1/1     Running     0          5m50s
pod/spartakus-volunteer-5b7d86d9cd-2z4dn                           1/1     Running     0          5m49s
pod/tensorboard-6544748d94-dr87g                                   1/1     Running     0          5m48s
pod/tf-job-operator-7d7c8fb8bb-bh2j9                               1/1     Running     0          5m48s
pod/workflow-controller-945c84565-ctx84                            1/1     Running     0          5m51s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/admission-webhook-service ClusterIP 10.100.34.137 443/TCP 5m50s service/application-controller-service ClusterIP 10.100.122.252 443/TCP 6m32s service/argo-ui NodePort 10.100.56.77 80:32722/TCP 5m51s service/centraldashboard ClusterIP 10.100.122.184 80/TCP 5m51s service/jupyter-web-app-service ClusterIP 10.100.184.50 80/TCP 5m50s service/katib-controller ClusterIP 10.100.96.16 443/TCP,8080/TCP 5m48s service/katib-db-manager ClusterIP 10.100.161.38 6789/TCP 5m48s service/katib-mysql ClusterIP 10.100.186.115 3306/TCP 5m48s service/katib-ui ClusterIP 10.100.110.39 80/TCP 5m48s service/metadata-db ClusterIP 10.100.92.177 3306/TCP 5m49s service/metadata-envoy-service ClusterIP 10.100.17.145 9090/TCP 5m49s service/metadata-grpc-service ClusterIP 10.100.238.212 8080/TCP 5m49s service/metadata-service ClusterIP 10.100.183.244 8080/TCP 5m49s service/metadata-ui ClusterIP 10.100.28.97 80/TCP 5m49s service/minio-service ClusterIP 10.100.185.36 9000/TCP 5m48s service/ml-pipeline ClusterIP 10.100.45.162 8888/TCP,8887/TCP 5m48s service/ml-pipeline-ml-pipeline-visualizationserver ClusterIP 10.100.211.60 8888/TCP 5m47s service/ml-pipeline-tensorboard-ui ClusterIP 10.100.150.113 80/TCP 5m47s service/ml-pipeline-ui ClusterIP 10.100.135.60 80/TCP 5m47s service/mysql ClusterIP 10.100.37.144 3306/TCP 5m48s service/notebook-controller-service ClusterIP 10.100.250.183 443/TCP 5m49s service/profiles-kfam ClusterIP 10.100.24.246 8081/TCP 5m47s service/pytorch-operator ClusterIP 10.100.104.208 8443/TCP 5m49s service/seldon-webhook-service ClusterIP 10.100.68.153 443/TCP 5m46s service/tensorboard ClusterIP 10.100.25.5 9000/TCP 5m49s service/tf-job-operator ClusterIP 10.100.165.41 8443/TCP 5m48s

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/nvidia-device-plugin-daemonset 5 5 5 5 5 5m46s

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/admission-webhook-deployment 1/1 1 1 5m50s deployment.apps/alb-ingress-controller 1/1 1 1 5m46s deployment.apps/argo-ui 1/1 1 1 5m51s deployment.apps/centraldashboard 1/1 1 1 5m51s deployment.apps/jupyter-web-app-deployment 1/1 1 1 5m50s deployment.apps/katib-controller 1/1 1 1 5m48s deployment.apps/katib-db-manager 1/1 1 1 5m48s deployment.apps/katib-mysql 1/1 1 1 5m48s deployment.apps/katib-ui 1/1 1 1 5m48s deployment.apps/metadata-db 1/1 1 1 5m49s deployment.apps/metadata-deployment 1/1 1 1 5m49s deployment.apps/metadata-envoy-deployment 1/1 1 1 5m49s deployment.apps/metadata-grpc-deployment 1/1 1 1 5m49s deployment.apps/metadata-ui 1/1 1 1 5m49s deployment.apps/minio 1/1 1 1 5m48s deployment.apps/ml-pipeline 1/1 1 1 5m48s deployment.apps/ml-pipeline-ml-pipeline-visualizationserver 1/1 1 1 5m47s deployment.apps/ml-pipeline-persistenceagent 1/1 1 1 5m48s deployment.apps/ml-pipeline-scheduledworkflow 1/1 1 1 5m47s deployment.apps/ml-pipeline-ui 1/1 1 1 5m47s deployment.apps/ml-pipeline-viewer-controller-deployment 1/1 1 1 5m47s deployment.apps/mpi-operator 1/1 1 1 5m46s deployment.apps/mysql 1/1 1 1 5m48s deployment.apps/notebook-controller-deployment 1/1 1 1 5m49s deployment.apps/profiles-deployment 1/1 1 1 5m47s deployment.apps/pytorch-operator 1/1 1 1 5m49s deployment.apps/seldon-controller-manager 1/1 1 1 5m46s deployment.apps/spark-operatorsparkoperator 1/1 1 1 5m50s deployment.apps/spartakus-volunteer 1/1 1 1 5m49s deployment.apps/tensorboard 1/1 1 1 5m49s deployment.apps/tf-job-operator 1/1 1 1 5m48s deployment.apps/workflow-controller 1/1 1 1 5m51s

NAME DESIRED CURRENT READY AGE replicaset.apps/admission-webhook-deployment-64cb96ddbf 1 1 1 5m50s replicaset.apps/alb-ingress-controller-c76dd95d 1 1 1 5m45s replicaset.apps/argo-ui-778676df64 1 1 1 5m51s replicaset.apps/centraldashboard-7dd7dd685d 1 1 1 5m51s replicaset.apps/jupyter-web-app-deployment-89789fd5 1 1 1 5m50s replicaset.apps/katib-controller-6b789b6cb5 1 1 1 5m48s replicaset.apps/katib-db-manager-64f548b47c 1 1 1 5m48s replicaset.apps/katib-mysql-57884cb488 1 1 1 5m48s replicaset.apps/katib-ui-5c5cc6bd77 1 1 1 5m48s replicaset.apps/metadata-db-76c9f78f77 1 1 1 5m49s replicaset.apps/metadata-deployment-674fdd976b 1 1 1 5m49s replicaset.apps/metadata-envoy-deployment-5688989bd6 1 1 1 5m49s replicaset.apps/metadata-grpc-deployment-5579bdc87b 1 1 1 5m49s replicaset.apps/metadata-ui-9b8cd699d 1 1 1 5m49s replicaset.apps/minio-755ff748b 1 1 1 5m47s replicaset.apps/ml-pipeline-79b4f85cbc 1 1 1 5m47s replicaset.apps/ml-pipeline-ml-pipeline-visualizationserver-5fdffdc5bf 1 1 1 5m46s replicaset.apps/ml-pipeline-persistenceagent-645cb66874 1 1 1 5m47s replicaset.apps/ml-pipeline-scheduledworkflow-6c978b6b85 1 1 1 5m46s replicaset.apps/ml-pipeline-ui-6995b7bccf 1 1 1 5m47s replicaset.apps/ml-pipeline-viewer-controller-deployment-8554dc7b9f 1 1 1 5m46s replicaset.apps/mpi-operator-5bf8b566b7 1 1 1 5m45s replicaset.apps/mysql-598bc897dc 1 1 1 5m47s replicaset.apps/notebook-controller-deployment-7db57b9ccf 1 1 1 5m49s replicaset.apps/profiles-deployment-b45dbc6f 1 1 1 5m46s replicaset.apps/pytorch-operator-5fd5f94bdd 1 1 1 5m49s replicaset.apps/seldon-controller-manager-679fc777cd 1 1 1 5m45s replicaset.apps/spark-operatorsparkoperator-c7b64b87f 1 1 1 5m50s replicaset.apps/spartakus-volunteer-5b7d86d9cd 1 1 1 5m49s replicaset.apps/tensorboard-6544748d94 1 1 1 5m48s replicaset.apps/tf-job-operator-7d7c8fb8bb 1 1 1 5m48s replicaset.apps/workflow-controller-945c84565 1 1 1 5m51s

NAME READY AGE statefulset.apps/admission-webhook-bootstrap-stateful-set 1/1 5m50s statefulset.apps/application-controller-stateful-set 1/1 6m32s statefulset.apps/metacontroller 1/1 5m51s

NAME COMPLETIONS DURATION AGE job.batch/spark-operatorcrd-cleanup 1/1 42s 5m50s

可选操作:创建profile

Kubeflow提供多租户支持,用户无法在Kubeflow的默认名称空间中创建笔记本。

第一次访问kubeflow时,可以使用一个匿名命名空间。如果您想要创建不同的jupter用户空间,您可以创建配置文件,然后运行kubectl apply -f Profile .yaml。kubeflow配置文件控制器将创建新的名称空间和服务帐户,允许在该名称空间中创建笔记本。

apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
  name: aws-sample-user
spec:
  owner:
    kind: User
    name: aws-sample-user