本kubeflow workshop 需要的软件环境有 aws cli , eksctl ,kubectl,e aws-iam-authenticator以及eks对应操作的IAM权限。具体安装步骤请参考2020_EKS_Launch_Workshop/步骤1-准备实验环境。
本workshop选择了缺省的kfctl配置文件,以简化kubeflow安装。但是,我们建议在生产环境中安装Cognito配置并添加身份验证和SSL(通过ACM)。有关启用Cognito所需的其他步骤,请参阅Kubeflow documentation文档
执行以下命令来设置环境变量,然后为这个部署设置Kubeflow应用程序目录,在CLUSTER_NAME指定您的EKS集群名称
export REGION=cn-northwest-1
export CLUSTER_NAME=eks
export BASE_DIR=$(pwd)
export KF_NAME=${CLUSTER_NAME}
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_FILE=${KF_DIR}/kfctl_aws.yaml
由于Kubeflow需要较多的资源来部署,通过执行以下操作创建一个6个工作节点EKS集群,大约需要15分钟,请耐心等待。
eksctl create cluster --name=${CLUSTER_NAME} --nodes=4 --managed --alb-ingress-access --region=${REGION}
获取EKS 工作节点role,配置环境变量用于后续使用
export STACK_NAME=$(eksctl get nodegroup --cluster $CLUSTER_NAME --region $REGION -o json | jq -r '.[].StackName')
export NODE_INSTANCE_ROLE=$(aws cloudformation describe-stack-resources --region $REGION --stack-name $STACK_NAME | jq -r '.StackResources[] | select(.LogicalResourceId=="NodeInstanceRole") | .PhysicalResourceId' )
下载并安装kfctl
curl --silent --location "https://github.com/kubeflow/kfctl/releases/download/v1.0.1/kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz" | tar xz -C /tmp
sudo mv -v /tmp/kfctl /usr/local/bin
配置和下载kubeflow文件,本实验使用非cognito版本,默认不进行身份验证。(注意:如果存在文件不能下载的问题,可以尝试重新运行wget -O ${KF_DIR}/kfctl_aws.yaml $CONFIG_URI)
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_aws.v1.0.1.yaml"
export KF_NAME=${CLUSTER_NAME}
mkdir -p ${BASE_DIR}
export KF_DIR=${BASE_DIR}/${KF_NAME}
mkdir -p ${KF_DIR}
wget -O ${KF_DIR}/kfctl_aws.yaml $CONFIG_URI
export CONFIG_FILE=${KF_DIR}/kfctl_aws.yaml
替换kfctl_aws.yaml中的region和role为当前的创建eks的region和node节点使用的role
sed -i'.bak' ${CONFIG_FILE}
sed -i -e "s/eksctl-kubeflow-aws-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx/$NODE_INSTANCE_ROLE/g" ${CONFIG_FILE}
sed -i -e 's/us-west-2/'"$REGION"'/' ${CONFIG_FILE}
检查kfctl_aws.yaml是否正确替换
region: cn-northwest-1
roles:
- eksctl-kubeflow-example-nodegroup-ng-185-NodeInstanceRole-1DDJJXQBG9EM6
kfclt 本质上是使用了 kustomize 来安装,通过kfctl build生成kubeflow kustomize配置文件
kfctl build -f ${CONFIG_FILE}
由于防火墙或安全限制,海外gcr.io, quay.io的镜像可能无法下载,需要通过修改镜像的方式安装,把镜像url替换成aws国内镜像站点url:
sed -i "s/gcr.io/048912060910.dkr.ecr.cn-northwest-1.amazonaws.com.cn\/gcr/g" `grep "gcr.io" -rl ${KF_DIR}`
sed -i "s/quay.io/048912060910.dkr.ecr.cn-northwest-1.amazonaws.com.cn\/quay/g" `grep "quay.io" -rl ${KF_DIR}`
开始部署kubeflow
kfctl apply -V -f ${CONFIG_FILE}
安装Kubeflow及其工具集可能需要数分钟。有一些的pod最初可能会出现Error或CrashLoopBackOff状态。需要一些时间,它们会自动修复,并进入运行状态
运行下面的命令检查状态
kubectl -n kubeflow get all
如果一段时间后仍不正常,请通过查看日志进行故障排除
状态信息
$ kubectl -n kubeflow get all NAME READY STATUS RESTARTS AGE pod/admission-webhook-bootstrap-stateful-set-0 1/1 Running 0 5m50s pod/admission-webhook-deployment-64cb96ddbf-x2zfm 1/1 Running 0 5m12s pod/alb-ingress-controller-c76dd95d-z2kc7 1/1 Running 0 5m45s pod/application-controller-stateful-set-0 1/1 Running 0 6m32s pod/argo-ui-778676df64-w4lpj 1/1 Running 0 5m51s pod/centraldashboard-7dd7dd685d-fjnr2 1/1 Running 0 5m51s pod/jupyter-web-app-deployment-89789fd5-pmwmf 1/1 Running 0 5m50s pod/katib-controller-6b789b6cb5-rc7xz 1/1 Running 1 5m48s pod/katib-db-manager-64f548b47c-6p6nv 1/1 Running 0 5m48s pod/katib-mysql-57884cb488-6g9zk 1/1 Running 0 5m48s pod/katib-ui-5c5cc6bd77-mwmrl 1/1 Running 0 5m48s pod/metacontroller-0 1/1 Running 0 5m51s pod/metadata-db-76c9f78f77-pjvh8 1/1 Running 0 5m49s pod/metadata-deployment-674fdd976b-946k6 1/1 Running 0 5m49s pod/metadata-envoy-deployment-5688989bd6-j5bdh 1/1 Running 0 5m49s pod/metadata-grpc-deployment-5579bdc87b-fc88k 1/1 Running 2 5m49s pod/metadata-ui-9b8cd699d-drm2p 1/1 Running 0 5m49s pod/minio-755ff748b-hdfwk 1/1 Running 0 5m47s pod/ml-pipeline-79b4f85cbc-hcttq 1/1 Running 5 5m47s pod/ml-pipeline-ml-pipeline-visualizationserver-5fdffdc5bf-nqjb5 1/1 Running 0 5m46s pod/ml-pipeline-persistenceagent-645cb66874-rgrt4 1/1 Running 1 5m47s pod/ml-pipeline-scheduledworkflow-6c978b6b85-dxgw4 1/1 Running 0 5m46s pod/ml-pipeline-ui-6995b7bccf-ktwb2 1/1 Running 0 5m47s pod/ml-pipeline-viewer-controller-deployment-8554dc7b9f-n4ccc 1/1 Running 0 5m46s pod/mpi-operator-5bf8b566b7-gkbz9 1/1 Running 0 5m45s pod/mysql-598bc897dc-srtpt 1/1 Running 0 5m47s pod/notebook-controller-deployment-7db57b9ccf-4pqkw 1/1 Running 0 5m49s pod/nvidia-device-plugin-daemonset-4s9tv 1/1 Running 0 5m46s pod/nvidia-device-plugin-daemonset-5p8kn 1/1 Running 0 5m46s pod/nvidia-device-plugin-daemonset-84jv6 1/1 Running 0 5m46s pod/nvidia-device-plugin-daemonset-d7x5f 1/1 Running 0 5m46s pod/nvidia-device-plugin-daemonset-m8cpr 1/1 Running 0 5m46s pod/profiles-deployment-b45dbc6f-7jfqw 2/2 Running 0 5m46s pod/pytorch-operator-5fd5f94bdd-dbddk 1/1 Running 0 5m49s pod/seldon-controller-manager-679fc777cd-58vzl 1/1 Running 0 5m45s pod/spark-operatorcrd-cleanup-tc4nw 0/2 Completed 0 5m50s pod/spark-operatorsparkoperator-c7b64b87f-w6glw 1/1 Running 0 5m50s pod/spartakus-volunteer-5b7d86d9cd-2z4dn 1/1 Running 0 5m49s pod/tensorboard-6544748d94-dr87g 1/1 Running 0 5m48s pod/tf-job-operator-7d7c8fb8bb-bh2j9 1/1 Running 0 5m48s pod/workflow-controller-945c84565-ctx84 1/1 Running 0 5m51sNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/admission-webhook-service ClusterIP 10.100.34.137 443/TCP 5m50s service/application-controller-service ClusterIP 10.100.122.252 443/TCP 6m32s service/argo-ui NodePort 10.100.56.77 80:32722/TCP 5m51s service/centraldashboard ClusterIP 10.100.122.184 80/TCP 5m51s service/jupyter-web-app-service ClusterIP 10.100.184.50 80/TCP 5m50s service/katib-controller ClusterIP 10.100.96.16 443/TCP,8080/TCP 5m48s service/katib-db-manager ClusterIP 10.100.161.38 6789/TCP 5m48s service/katib-mysql ClusterIP 10.100.186.115 3306/TCP 5m48s service/katib-ui ClusterIP 10.100.110.39 80/TCP 5m48s service/metadata-db ClusterIP 10.100.92.177 3306/TCP 5m49s service/metadata-envoy-service ClusterIP 10.100.17.145 9090/TCP 5m49s service/metadata-grpc-service ClusterIP 10.100.238.212 8080/TCP 5m49s service/metadata-service ClusterIP 10.100.183.244 8080/TCP 5m49s service/metadata-ui ClusterIP 10.100.28.97 80/TCP 5m49s service/minio-service ClusterIP 10.100.185.36 9000/TCP 5m48s service/ml-pipeline ClusterIP 10.100.45.162 8888/TCP,8887/TCP 5m48s service/ml-pipeline-ml-pipeline-visualizationserver ClusterIP 10.100.211.60 8888/TCP 5m47s service/ml-pipeline-tensorboard-ui ClusterIP 10.100.150.113 80/TCP 5m47s service/ml-pipeline-ui ClusterIP 10.100.135.60 80/TCP 5m47s service/mysql ClusterIP 10.100.37.144 3306/TCP 5m48s service/notebook-controller-service ClusterIP 10.100.250.183 443/TCP 5m49s service/profiles-kfam ClusterIP 10.100.24.246 8081/TCP 5m47s service/pytorch-operator ClusterIP 10.100.104.208 8443/TCP 5m49s service/seldon-webhook-service ClusterIP 10.100.68.153 443/TCP 5m46s service/tensorboard ClusterIP 10.100.25.5 9000/TCP 5m49s service/tf-job-operator ClusterIP 10.100.165.41 8443/TCP 5m48s
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/nvidia-device-plugin-daemonset 5 5 5 5 5 5m46s
NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/admission-webhook-deployment 1/1 1 1 5m50s deployment.apps/alb-ingress-controller 1/1 1 1 5m46s deployment.apps/argo-ui 1/1 1 1 5m51s deployment.apps/centraldashboard 1/1 1 1 5m51s deployment.apps/jupyter-web-app-deployment 1/1 1 1 5m50s deployment.apps/katib-controller 1/1 1 1 5m48s deployment.apps/katib-db-manager 1/1 1 1 5m48s deployment.apps/katib-mysql 1/1 1 1 5m48s deployment.apps/katib-ui 1/1 1 1 5m48s deployment.apps/metadata-db 1/1 1 1 5m49s deployment.apps/metadata-deployment 1/1 1 1 5m49s deployment.apps/metadata-envoy-deployment 1/1 1 1 5m49s deployment.apps/metadata-grpc-deployment 1/1 1 1 5m49s deployment.apps/metadata-ui 1/1 1 1 5m49s deployment.apps/minio 1/1 1 1 5m48s deployment.apps/ml-pipeline 1/1 1 1 5m48s deployment.apps/ml-pipeline-ml-pipeline-visualizationserver 1/1 1 1 5m47s deployment.apps/ml-pipeline-persistenceagent 1/1 1 1 5m48s deployment.apps/ml-pipeline-scheduledworkflow 1/1 1 1 5m47s deployment.apps/ml-pipeline-ui 1/1 1 1 5m47s deployment.apps/ml-pipeline-viewer-controller-deployment 1/1 1 1 5m47s deployment.apps/mpi-operator 1/1 1 1 5m46s deployment.apps/mysql 1/1 1 1 5m48s deployment.apps/notebook-controller-deployment 1/1 1 1 5m49s deployment.apps/profiles-deployment 1/1 1 1 5m47s deployment.apps/pytorch-operator 1/1 1 1 5m49s deployment.apps/seldon-controller-manager 1/1 1 1 5m46s deployment.apps/spark-operatorsparkoperator 1/1 1 1 5m50s deployment.apps/spartakus-volunteer 1/1 1 1 5m49s deployment.apps/tensorboard 1/1 1 1 5m49s deployment.apps/tf-job-operator 1/1 1 1 5m48s deployment.apps/workflow-controller 1/1 1 1 5m51s
NAME DESIRED CURRENT READY AGE replicaset.apps/admission-webhook-deployment-64cb96ddbf 1 1 1 5m50s replicaset.apps/alb-ingress-controller-c76dd95d 1 1 1 5m45s replicaset.apps/argo-ui-778676df64 1 1 1 5m51s replicaset.apps/centraldashboard-7dd7dd685d 1 1 1 5m51s replicaset.apps/jupyter-web-app-deployment-89789fd5 1 1 1 5m50s replicaset.apps/katib-controller-6b789b6cb5 1 1 1 5m48s replicaset.apps/katib-db-manager-64f548b47c 1 1 1 5m48s replicaset.apps/katib-mysql-57884cb488 1 1 1 5m48s replicaset.apps/katib-ui-5c5cc6bd77 1 1 1 5m48s replicaset.apps/metadata-db-76c9f78f77 1 1 1 5m49s replicaset.apps/metadata-deployment-674fdd976b 1 1 1 5m49s replicaset.apps/metadata-envoy-deployment-5688989bd6 1 1 1 5m49s replicaset.apps/metadata-grpc-deployment-5579bdc87b 1 1 1 5m49s replicaset.apps/metadata-ui-9b8cd699d 1 1 1 5m49s replicaset.apps/minio-755ff748b 1 1 1 5m47s replicaset.apps/ml-pipeline-79b4f85cbc 1 1 1 5m47s replicaset.apps/ml-pipeline-ml-pipeline-visualizationserver-5fdffdc5bf 1 1 1 5m46s replicaset.apps/ml-pipeline-persistenceagent-645cb66874 1 1 1 5m47s replicaset.apps/ml-pipeline-scheduledworkflow-6c978b6b85 1 1 1 5m46s replicaset.apps/ml-pipeline-ui-6995b7bccf 1 1 1 5m47s replicaset.apps/ml-pipeline-viewer-controller-deployment-8554dc7b9f 1 1 1 5m46s replicaset.apps/mpi-operator-5bf8b566b7 1 1 1 5m45s replicaset.apps/mysql-598bc897dc 1 1 1 5m47s replicaset.apps/notebook-controller-deployment-7db57b9ccf 1 1 1 5m49s replicaset.apps/profiles-deployment-b45dbc6f 1 1 1 5m46s replicaset.apps/pytorch-operator-5fd5f94bdd 1 1 1 5m49s replicaset.apps/seldon-controller-manager-679fc777cd 1 1 1 5m45s replicaset.apps/spark-operatorsparkoperator-c7b64b87f 1 1 1 5m50s replicaset.apps/spartakus-volunteer-5b7d86d9cd 1 1 1 5m49s replicaset.apps/tensorboard-6544748d94 1 1 1 5m48s replicaset.apps/tf-job-operator-7d7c8fb8bb 1 1 1 5m48s replicaset.apps/workflow-controller-945c84565 1 1 1 5m51s
NAME READY AGE statefulset.apps/admission-webhook-bootstrap-stateful-set 1/1 5m50s statefulset.apps/application-controller-stateful-set 1/1 6m32s statefulset.apps/metacontroller 1/1 5m51s
NAME COMPLETIONS DURATION AGE job.batch/spark-operatorcrd-cleanup 1/1 42s 5m50s
Kubeflow提供多租户支持,用户无法在Kubeflow的默认名称空间中创建笔记本。
第一次访问kubeflow时,可以使用一个匿名命名空间。如果您想要创建不同的jupter用户空间,您可以创建配置文件,然后运行kubectl apply -f Profile .yaml。kubeflow配置文件控制器将创建新的名称空间和服务帐户,允许在该名称空间中创建笔记本。
apiVersion: kubeflow.org/v1beta1
kind: Profile
metadata:
name: aws-sample-user
spec:
owner:
kind: User
name: aws-sample-user