Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tekton Pipeline example #1339

Merged
merged 4 commits into from
Oct 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions examples/v1beta1/tekton/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Katib examples with Tekton integration

Here you can find examples of using Katib with [Tekton](https://github.com/tektoncd/pipeline).

Check [here](https://github.com/tektoncd/pipeline/blob/master/docs/install.md#installing-tekton-pipelines-on-kubernetes)
how to install Tekton on your cluster.

**Note** that you must modify Tekton [`nop`](https://github.com/tektoncd/pipeline/tree/master/cmd/nop)
image to run Tekton pipelines. `Nop` image is used to stop sidecar containers after main container
is completed. Metrics collector should not be stopped after training container is finished.
To avoid this problem, set `nop` image to metrics collector sidecar image.

For example, if you are using
[StdOut](https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#metrics-collector) metrics collector,
`nop` image must be equal to `gcr.io/kubeflow-images-public/katib/v1beta1/file-metrics-collector`.

After deploying Tekton on your cluster, run bellow command to modify `nop` image:

```bash
kubectl patch deploy tekton-pipelines-controller -n tekton-pipelines --type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/args/9", "value": "gcr.io/kubeflow-images-public/katib/v1beta1/file-metrics-collector"}]'
```

Check that Tekton controller's pod was restarted:

```bash
$ kubectl get pods -n tekton-pipelines

NAME READY STATUS RESTARTS AGE
tekton-pipelines-controller-7fcb6c6cd4-p8zf2 1/1 Running 0 2m2s
tekton-pipelines-webhook-7f9888f9b-7d6mr 1/1 Running 0 12h
```

Check that `nop` image was modified:

```bash
$ kubectl get pod <tekton-controller-pod-name> -n tekton-pipelines -o yaml | grep katib/v1beta1/file-metrics-collector

- gcr.io/kubeflow-images-public/katib/v1beta1/file-metrics-collector
```
96 changes: 96 additions & 0 deletions examples/v1beta1/tekton/pipeline-run.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# This example shows how you can use Tekton Pipelines in Katib, transfer parameters from one Task to another and run HP job.
# It uses simple random algorithm and tunes only learning rate.
# Pipelines contains 2 Tasks, first is data-preprocessing second is model-training.
# First Task shows how you can prepare your training data (here: simply divide number of training examples) before running HP job.
# Number of training examples is transferred to the second Task.
# Second Task is the actual training which metrics collector sidecar is injected.
# Note that for this example Tekton controller's nop image must be equal to StdOut metrics collector image.
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
namespace: kubeflow
name: tekton-pipeline-run
spec:
objective:
type: maximize
goal: 0.99
objectiveMetricName: Validation-accuracy
additionalMetricNames:
- Train-accuracy
algorithm:
algorithmName: random
parallelTrialCount: 2
maxTrialCount: 4
maxFailedTrialCount: 3
parameters:
- name: lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03"
trialTemplate:
retain: true
primaryPodLabels:
tekton.dev/pipelineTask: model-training
primaryContainerName: step-model-training
successCondition: status.conditions.#(type=="Succeeded")#|#(status=="True")#
failureCondition: status.conditions.#(type=="Succeeded")#|#(status=="False")#
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: lr
trialSpec:
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
spec:
params:
- name: lr
value: ${trialParameters.learningRate}
- name: num-examples-init
value: "60000"
pipelineSpec:
params:
- name: lr
description: Learning rate for the training model
- name: num-examples-init
description: Initial value for number of training examples
tasks:
- name: data-preprocessing
params:
- name: num-examples-pre
value: $(params.num-examples-init)
taskSpec:
params:
- name: num-examples-pre
description: Number of training examples before optimization
results:
- name: num-examples-post
description: Number of training examples after optimization
steps:
- name: num-examples-optimize
image: python:alpine3.6
command:
- sh
- -c
args:
- python3 -c "import random; print($(params.num-examples-pre)//random.randint(10,100),end='')" | tee $(results.num-examples-post.path)
- name: model-training
params:
- name: lr
value: $(params.lr)
- name: num-examples
value: $(tasks.data-preprocessing.results.num-examples-post)
taskSpec:
params:
- name: lr
description: Learning rate for the training model
- name: num-examples
description: Number of training examples
steps:
- name: model-training
image: docker.io/kubeflowkatib/mxnet-mnist
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--num-examples=$(params.num-examples)"
- "--lr=$(params.lr)"
1 change: 1 addition & 0 deletions manifests/v1beta1/katib-controller/katib-controller.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ spec:
- "--trial-resources=TFJob.v1.kubeflow.org"
- "--trial-resources=PyTorchJob.v1.kubeflow.org"
- "--trial-resources=MPIJob.v1.kubeflow.org"
- "--trial-resources=PipelineRun.v1beta1.tekton.dev"
ports:
- containerPort: 8443
name: webhook
Expand Down
6 changes: 6 additions & 0 deletions manifests/v1beta1/katib-controller/rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,12 @@ rules:
- mpijobs
verbs:
- "*"
- apiGroups:
- tekton.dev
resources:
- pipelineruns
verbs:
- "*"
---
apiVersion: v1
kind: ServiceAccount
Expand Down