Skip to content

Latest commit

 

History

History
113 lines (92 loc) · 5.61 KB

operator.md

File metadata and controls

113 lines (92 loc) · 5.61 KB

Kubeflow Operator

Kubeflow Operator helps deploy, monitor and manage the lifecycle of Kubeflow. Built using the Operator Framework which offers an open source toolkit to build, test, package operators and manage the lifecycle of operators.

The Operator is currently in incubation phase and is based on this design doc. It is built on top of kfdef CR, and uses kfctl as the nucleus for Controller. Current roadmap for this Operator is listed here.

Deployment Instructions

  1. Clone this repository and deploy the CRD and controller
# git clone https://github.com/kubeflow/kfctl.git && cd kfctl
OPERATOR_NAMESPACE=operators
kubectl create ns ${OPERATOR_NAMESPACE}
kubectl create -f deploy/crds/kfdef.apps.kubeflow.org_kfdefs_crd.yaml
kubectl create -f deploy/service_account.yaml -n ${OPERATOR_NAMESPACE}
kubectl create clusterrolebinding kubeflow-operator --clusterrole cluster-admin --serviceaccount=${OPERATOR_NAMESPACE}:kubeflow-operator
kubectl create -f deploy/operator.yaml -n ${OPERATOR_NAMESPACE}
  1. Deploy KfDef. You can optionally apply ResourceQuota if your Kubernetes version is 1.15+, which will allow only one kfdef instance or one deployment of Kubeflow on this cluster, which follows the singleton model. we use ResourceQuota to provide constraints that only one instance of kfdef is allowed within the Kubeflow namespace.
KUBEFLOW_NAMESPACE=kubeflow
kubectl create ns ${KUBEFLOW_NAMESPACE}
# kubectl create -f deploy/crds/kfdef_quota.yaml -n ${KUBEFLOW_NAMESPACE} # only deploy this if the k8s cluster is 1.15+ and has resource quota support
kubectl create -f <kfdef> -n ${KUBEFLOW_NAMESPACE}

above can point to a remote URL or to a local kfdef file. For e.g. for IBM Cloud, command will be

kubectl create -f https://raw.githubusercontent.com/kubeflow/manifests/master/kfdef/kfctl_ibm.yaml -n ${KUBEFLOW_NAMESPACE}

Testing Watcher and Reconciler

One of the major benefits of using kfctl as an Operator is to leverage the functionalities around being able to watch and reconcile your Kubeflow deployments. The Operator is watching all the resources with the kfctl label. If one of the resources is deleted, the reconciler will be triggered and re-apply the kfdef to the Kubernetes Cluster.

  1. Check the tf-job-operator deployment is running.
kubectl get deploy -n ${KUBEFLOW_NAMESPACE} tf-job-operator
# NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
# tf-job-operator                               1/1     1            1           7m15s
  1. Delete the tf-job-operator deployment
kubectl delete deploy -n ${KUBEFLOW_NAMESPACE} tf-job-operator
# deployment.extensions "tf-job-operator" deleted
  1. Wait for 10 to 15 seconds, then check the tf-job-operator deployment again. You will be able to see that the deployment is being recreated by the Operator's reconciliation logic
kubectl get deploy -n ${KUBEFLOW_NAMESPACE} tf-job-operator
# NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
# tf-job-operator                               0/1     0            0           10s

Delete KubeFlow

Delete KubeFlow deployment

kubectl delete kfdef -n ${KUBEFLOW_NAMESPACE} --all

Delete KubeFlow Operator

kubectl delete -f deploy/operator.yaml -n ${OPERATOR_NAMESPACE}
kubectl delete clusterrolebinding kubeflow-operator
kubectl delete -f deploy/service_account.yaml -n ${OPERATOR_NAMESPACE}
kubectl delete -f deploy/crds/kfdef.apps.kubeflow.org_kfdefs_crd.yaml
kubectl delete ns ${OPERATOR_NAMESPACE}

Optional: Registering the Operator to OLM Catalog

Please follow the instructions here to register your Operator to OLM if you are using that to install and manage the Operator.

TroubleShooting

  • When deleting the KubeFlow deployment, it's using kfctl delete in the background where it only deletes the deployment namespace. This will make some of KubeFlow pod deployments hanging because mutatingwebhookconfigurations are cluster-wide resources and some of the webhooks are watching every pod deployment. Therefore, we need to remove all the mutatingwebhookconfigurations so that pod deployments will not be hanging after deleting KubeFlow.
kubectl delete mutatingwebhookconfigurations admission-webhook-mutating-webhook-configuration
kubectl delete mutatingwebhookconfigurations inferenceservice.serving.kubeflow.org
kubectl delete mutatingwebhookconfigurations istio-sidecar-injector
kubectl delete mutatingwebhookconfigurations katib-mutating-webhook-config
kubectl delete mutatingwebhookconfigurations mutating-webhook-configurations

Development Instructions

Prerequisites

  1. Install operator-sdk
  2. Install golang

Build Instructions

These steps are based on the operator-sdk with modifications that are specific for this KubeFlow operator.

  1. Clone this repository under your $GOPATH. (e.g. ~/go/src/github.com/kubeflow/)
git clone https://github.com/kubeflow/kfctl
cd kfctl
  1. Build and push the operator
export OPERATOR_IMG=<docker_username>/kubeflow-operator
make build-operator
make push-operator