From d41f8e85c81f056cef3af5c6cfad250a598ac5db Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Wed, 16 Jan 2019 07:30:01 -0800 Subject: [PATCH] Add information how to run TFjob and Pytorch examples in Katib (#321) * Add doc for tfjob and pytorch examples in Katib * Add contents * Fix README * Fix link to examples in README * Fix README * Add information about Katib UI and status of StudyJob * Add Ambassador information --- README.md | 301 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 301 insertions(+) diff --git a/README.md b/README.md index 9c21b4e7ad2..6c462cb4e20 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,31 @@ Hyperparameter Tuning on Kubernetes. This project is inspired by [Google vizier](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/bcb15507f4b52991a0783013df4222240e942381.pdf). Katib is a scalable and flexible hyperparameter tuning framework and is tightly integrated with kubernetes. Also it does not depend on a specific Deep Learning framework (e.g. TensorFlow, MXNet, and PyTorch). + + + +**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* + +- [Name](#name) +- [Concepts in Google Vizier](#concepts-in-google-vizier) + - [Study](#study) + - [Trial](#trial) + - [Suggestion](#suggestion) +- [Components in Katib](#components-in-katib) +- [Getting Started](#getting-started) +- [Web UI](#web-ui) +- [API Documentation](#api-documentation) +- [Quickstart to run tfjob and pytorch operator jobs in Katib](#quickstart-to-run-tfjob-and-pytorch-operator-jobs-in-katib) + - [TFjob operator](#tfjob-operator) + - [Pytorch operator](#pytorch-operator) + - [Katib](#katib) + - [Running examples](#running-examples) + - [Cleanups](#cleanups) +- [CONTRIBUTING](#contributing) +- [TODOs](#todos) + + + ## Name Katib stands for `secretary` in Arabic. As `Vizier` stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier. @@ -65,6 +90,282 @@ You can visualize general trend of Hyper parameter space and each training histo Please refer to [api.md](./pkg/api/gen-doc/api.md). +## Quickstart to run tfjob and pytorch operator jobs in Katib + +For running tfjob and pytorch operator jobs in Katib, you have to install their packages. + +In your Ksonnet app root, run the following + +``` +export KF_ENV=default +ks env set ${KF_ENV} --namespace=kubeflow +ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow +``` + +### TFjob operator + +For installing tfjob operator, run the following + +``` +ks pkg install kubeflow/tf-training +ks pkg install kubeflow/common +ks generate tf-job-operator tf-job-operator +ks apply ${KF_ENV} -c tf-job-operator +``` + +### Pytorch operator +For installing pytorch operator, run the following + +``` +ks pkg install kubeflow/pytorch-job +ks generate pytorch-operator pytorch-operator +ks apply ${KF_ENV} -c pytorch-operator +``` + +### Katib + +Finally, you can install Katib + +``` +ks pkg install kubeflow/katib +ks generate katib katib +ks apply ${KF_ENV} -c katib +``` + +If you want to use Katib not in GKE and you don't have StorageClass for dynamic volume provisioning at your cluster, you have to create persistent volume to bound your persistent volume claim. + +This is yaml file for persistent volume + +```yaml +apiVersion: v1 +kind: PersistentVolume +metadata: + name: katib-mysql + labels: + type: local + app: katib +spec: + capacity: + storage: 10Gi + accessModes: + - ReadWriteOnce + hostPath: + path: /data/katib +``` + +Create this pv after deploying Katib package + +``` +kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/pv/pv.yaml +``` + +### Running examples + +After deploy everything, you can run examples. + +To run tfjob operator example, you have to install volume for it. + +If you are using GKE and default StorageClass, you have to create this pvc + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: tfevent-volume + namespace: kubeflow + labels: + type: local + app: tfjob +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 10Gi +``` + +If you are not using GKE and you don't have StorageClass for dynamic volume provisioning at your cluster, you have to create pvc and pv + +``` +kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pvc.yaml + +kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfevent-volume/tfevent-pv.yaml +``` + +This is example for tfjob operator + +``` +kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/tfjob-example.yaml +``` + +This is example for pytorch operator + +``` +kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/pytorchjob-example.yaml +``` + +You can check status of StudyJob + +```yaml +$ kubectl describe studyjob pytorchjob-example -n kubeflow + +Name: pytorchjob-example +Namespace: kubeflow +Labels: controller-tools.k8s.io=1.0 +Annotations: +API Version: kubeflow.org/v1alpha1 +Kind: StudyJob +Metadata: + Cluster Name: + Creation Timestamp: 2019-01-15T18:35:20Z + Generation: 1 + Resource Version: 1058135 + Self Link: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/pytorchjob-example + UID: 4fc7ad83-18f4-11e9-a6de-42010a8e0225 +Spec: + Metricsnames: + accuracy + Objectivevaluename: accuracy + Optimizationgoal: 0.99 + Optimizationtype: maximize + Owner: crd + Parameterconfigs: + Feasible: + Max: 0.05 + Min: 0.01 + Name: --lr + Parametertype: double + Feasible: + Max: 0.9 + Min: 0.5 + Name: --momentum + Parametertype: double + Requestcount: 4 + Study Name: pytorchjob-example + Suggestion Spec: + Request Number: 3 + Suggestion Algorithm: random + Suggestion Parameters: + Name: SuggestionCount + Value: 0 + Worker Spec: + Go Template: + Raw Template: apiVersion: "kubeflow.org/v1beta1" +kind: PyTorchJob +metadata: + name: {{.WorkerID}} + namespace: kubeflow +spec: + pytorchReplicaSpecs: + Master: + replicas: 1 + restartPolicy: OnFailure + template: + spec: + containers: + - name: pytorch + image: gcr.io/kubeflow-ci/pytorch-mnist-with-summary:1.0 + imagePullPolicy: Always + command: + - "python" + - "/opt/pytorch_dist_mnist/dist_mnist_with_summary.py" + {{- with .HyperParameters}} + {{- range .}} + - "{{.Name}}={{.Value}}" + {{- end}} + {{- end}} + Worker: + replicas: 2 + restartPolicy: OnFailure + template: + spec: + containers: + - name: pytorch + image: gcr.io/kubeflow-ci/pytorch-mnist-with-summary:1.0 + imagePullPolicy: Always + command: + - "python" + - "/opt/pytorch_dist_mnist/dist_mnist_with_summary.py" + {{- with .HyperParameters}} + {{- range .}} + - "{{.Name}}={{.Value}}" + {{- end}} + {{- end}} + Retain: true +Status: + Conditon: Running + Early Stopping Parameter Id: + Last Reconcile Time: 2019-01-15T18:35:20Z + Start Time: 2019-01-15T18:35:20Z + Studyid: k291b444a0b68631 + Suggestion Count: 1 + Suggestion Parameter Id: n6f17dd9ff466a2b + Trials: + Trialid: o104235328003ad9 + Workeridlist: + Completion Time: + Conditon: Running + Kind: PyTorchJob + Start Time: 2019-01-15T18:35:20Z + Workerid: b3b371c89144727f + Trialid: ca207b2432231de3 + Workeridlist: + Completion Time: + Conditon: Running + Kind: PyTorchJob + Start Time: 2019-01-15T18:35:20Z + Workerid: f291b04fb27ece3c + Trialid: ddff69212e826432 + Workeridlist: + Completion Time: + Conditon: Running + Kind: PyTorchJob + Start Time: 2019-01-15T18:35:20Z + Workerid: ncbed67bbcd4a8ed +Events: +``` + +When the spec.Status.Condition becomes ```Completed```, the StudyJob is finished. + +You can monitor your results in Katib UI. For accessing to Katib UI, you have to install Ambassador. + +In your Ksonnet app root, run the following + +``` +ks generate ambassador ambassador +ks apply ${KF_ENV} -c ambassador +``` + +After this, you have to port-forward Ambassador service + +``` +kubectl port-forward svc/ambassador -n kubeflow 8080:80 +``` + +Finally, you can access to Katib UI using this URL: ```http://localhost:8080/katib/```. + +### Cleanups + +Delete installed components + +``` +ks delete ${KF_ENV} -c katib +ks delete ${KF_ENV} -c pytorch-operator +ks delete ${KF_ENV} -c tf-job-operator +``` + +If you create pv for Katib, delete it + +``` +kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/pv/pv.yaml +``` + +If you deploy Ambassador, delete it + +``` +ks delete ${KF_ENV} -c ambassador +``` + ## CONTRIBUTING Please feel free to test the system! [developer-guide.md](./docs/developer-guide.md) is a good starting point for developers.