Skip to content

Commit

Permalink
Fixes #2: End to end model training/serving example using S3, Argo, a…
Browse files Browse the repository at this point in the history
…nd Kubeflow (#42)

* Add awscli tools container.

* Add initial readme.

* Add argo skeleton.

* Run a an argo job.

* Artifact support and argo test

* Use built container (#3)

* Fix artifacts and secrets

* Add work in progress tfflow (#14)

* Add kvc deployment to workflow.

* Switch aws repo.

* wip.

* Add working tfflow job.

* Add sidecar that waits for MASTER completion

* Pass in job-name

* Add volumemanager info step

* Add input parameters to step

* Adds nodeaffinity and hostpath

* Add fixes for workflow (#17)

- Use correct images for worker and ps
- Use correct aws keys
- Change volumemanager to mnist
- Comment unused steps
- Fix volume mount to correct containers

* Fix hostpath for tfjob

* Download all mnist files

* added GCS stored artifacts comptability to Argo

* Add initial inference workflow. (#30)

* Initial serving step (#31)

* Adds fixes to initial serving step

* Ready for rough demo: Workflow in working state

* Move conflicting readme.

* Initial commit, everything boots without crashing.

* Working, with some python errors.

* Adding explicit flags

* Working with ins-outs

* Letting training job exit on success

* Adding documentation skeletion

* trying to properly save model

* Almost working

* Working

* Adding export script, refactored to allow model more reusability

* Starting documentation

* little further on docs

* More doc updates, fixing sleep logic

* adding urls for mnist data

* Removing download logic, it's to tied in with build-in tf examples.

* Added argo workflow instructions, minor cleanups.

* Adding mnist client.

* Fixing typos

* Adding instructions for installing components.

* Added ksonnet container

* Adding new entrypoint.

* Added helm install instructions for kvc

* doing things with variables

* Typos.

* Added better namespace support

* S3 refactor.

* Added missing region variables.

* Adding tensorboard support.

* Addding Container for Tensorboard.

* Added temporary flag, added install instructions for CLI.

* Removing invalid ksonnet environment.

* Updating readme

* Cleanup currently unused pieces

* Add missint cluster-role

* Minor cleanup.

* Adding more parameters.

* added changes to allow model to train on multiple workers and fixed some doc typos

* Adding flag to enable/disable model serving. Adding s3 urls as outputs for future querying, renaming info step.

* Adding seperate deployer workflow.

* Split serving working.

* Adding split workflow.

* More parameters.

* updates as to elson comments

* Revert "added changes to allow model to train on multiple workers and fixed s…"

* Initial working pure-s3 workflow.

* Removed wait sidecars.

* Remove unused flag.

* Added part two, minor doc fixes

* Inverted links...

* Adding diff.

* Fix url syntax

* Documentation updates.

* Added AWS Cli

* Parameterized export.

* Fixing image in s3 version.

* Fixed documentation issues.

* KVC snippet changes, need to find last working helm chart.

* Temporarily pinning kvc version.

* working master model and some doc typos fixes (#13)

* added changes to allow model to train on multiple workers and fixed some doc typos

* Adding flag to enable/disable model serving. Adding s3 urls as outputs for future querying, renaming info step.

* Adding seperate deployer workflow.

* Split serving working.

* Adding split workflow.

* More parameters.

* updates as to elson comments

* working master model and some doc typos

* fixes as to Elson

* Removign whitespace differences

* updating diff

* Changing parameters.

* Undoing whitespace.

* Changing termination policy on s3 version due to unknown issue.

* Updating mnist diff.

* Changing train steps.

* Syncing Demo changes.

* Update README.md

* Going S3-native for initial example. Getting rid of Master.

* Minor documentation tweaks, adding params, swapping aws cli for minio.

* Updating KVC version.

* Switching ksonnet repo, removing model name from client.

* Updating git url.

* Adding certificate hack to avoid RBAC errors.

* Pinning KVC to commit while working on PR.

* Updating version.

* Updates README with additional details (#14)

* Updates README with additional details

* Adding clarity to kubectl config commands

* Fixed comma placement

* Refactoring notes for github and kubernetes credentials.

* Forgot to add an overview of the argo template.

* Updating example based on feedback.

- Removed superflous images
- Clarified use of KVC
- Added unaltered model
- Variable cleanup

* Refactored grpc image into generic base image.

* minor cleanup of resubmitting section.

* Switching Argo deployment to ksonnet, conslidating install instructions.

* Removing old cruft, clarifying cluster requirements.

* [WIP] Switching out model (#15)

* Switching to new mnist example.

* Parameterized model, testing export.

* Got CNN model exporting.

* Attempting to do distributed training with Estimator, removed seperate export.

* Adding master back, otherwise Estimator complains about not having a chief.

* Switching to tf.estimator.train_and_evaluate.

* Minor path/var name refactor.

* Adding test data and new client.

* Fixed documentation to reflect new client.

* Getting rid of tf job shim.

* Removing KVC from example, renaming directory

* Modifying parent README

* Removed reference to export.

* Adding reference to export.

* Removing unused Dockerfile.

* Removing uneeded files, simplifying how to get status, refactor model serving workflow step.

* Renaming directory

* Minor doc improvements, removed extra clis.

* Making SSL configurable for clusters without secured s3 endpoints.

* Added a tf-user account for workflow. Fixed serving bug.

* Updating gke version.

* Re-ran through instructions, fixed errata.

* Fixing lint issues

* Pylint errors

* Pylint errors

* Adding parenthesis back.

* pylint Hacks

* Disabling argument filter, model bombs without empty arg.

* Removing unneeded lambdas
  • Loading branch information
elsonrodriguez authored and k8s-ci-robot committed Apr 6, 2018
1 parent 063c9a5 commit 1be7ccb
Show file tree
Hide file tree
Showing 21 changed files with 1,261 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
## A repository to host extended examples and tutorials for kubeflow.

1. [Github issue summarization using sequence-to-sequence learning](./github_issue_summarization) by [Hamel Husain](https://github.com/hamelsmu)
1. [MNIST example using S3 for Training, Serving, and Tensorboard monitoring. Automated using Argo and Kubeflow](./mnist) by [Elson Rodriguez](https://github.com/elsonrodriguez)
31 changes: 31 additions & 0 deletions mnist/Dockerfile.ksonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# This container is for running ksonnet within Kubernetes
FROM ubuntu:16.04

ENV KUBECTL_VERSION v1.9.2
ENV KSONNET_VERSION 0.8.0

RUN apt-get update
RUN apt-get -y install curl
#RUN apk add --update ca-certificates openssl && update-ca-certificates

RUN curl -O -L https://github.com/ksonnet/ksonnet/releases/download/v${KSONNET_VERSION}/ks_${KSONNET_VERSION}_linux_amd64.tar.gz
RUN tar -zxvf ks_${KSONNET_VERSION}_linux_amd64.tar.gz -C /usr/bin/ --strip-components=1 ks_${KSONNET_VERSION}_linux_amd64/ks
RUN chmod +x /usr/bin/ks

RUN curl -L https://storage.googleapis.com/kubernetes-release/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -o /usr/bin/kubectl
RUN chmod +x /usr/bin/kubectl

#ksonnet doesn't work without a kubeconfig, the following is just to add a utility to generate a kubeconfig from a service account.
ADD https://raw.githubusercontent.com/zlabjp/kubernetes-scripts/cb265de1d4c4dc4ad0f15f4aaaf5b936dcf639a5/create-kubeconfig /usr/bin/
ADD https://raw.githubusercontent.com/zlabjp/kubernetes-scripts/cb265de1d4c4dc4ad0f15f4aaaf5b936dcf639a5/LICENSE.txt /usr/bin/create-kubeconfig.LICENSE
RUN chmod +x /usr/bin/create-kubeconfig

RUN kubectl config set-context default --cluster=default
RUN kubectl config use-context default

ENV USER root

ADD ksonnet-entrypoint.sh /
RUN chmod +x /ksonnet-entrypoint.sh

ENTRYPOINT ["/ksonnet-entrypoint.sh"]
8 changes: 8 additions & 0 deletions mnist/Dockerfile.model
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#This container contains your model and any helper scripts specific to your model.
FROM tensorflow/tensorflow:1.5.1

ADD model.py /opt/model.py
RUN chmod +x /opt/model.py

ENTRYPOINT ["/usr/bin/python"]
CMD ["/opt/model.py"]
352 changes: 352 additions & 0 deletions mnist/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,352 @@
# Training MNIST using Kubeflow, S3, and Argo.

This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model. We will be using Argo to manage the workflow, Tensorflow's S3 support for saving model training info, Tensorboard to visualize the training, and Kubeflow to deploy the Tensorflow operator and serve the model.

## Prerequisites

Before we get started there a few requirements.

### Kubernetes Cluster Environment

Your cluster must:

- Be at least version 1.9
- Have access to an S3-compatible object store ([Amazon S3](https://aws.amazon.com/s3/), [Google Storage](https://cloud.google.com/storage/docs/interoperability), [Minio](https://www.minio.io/kubernetes.html))
- Contain 3 nodes of at least 8 cores and 16 GB of RAM.

If using GKE, the following will provision a cluster with the required features:

```
export CLOUDSDK_CONTAINER_USE_CLIENT_CERTIFICATE=True
gcloud alpha container clusters create ${USER} --enable-kubernetes-alpha --machine-type=n1-standard-8 --num-nodes=3 --disk-size=200 --zone=us-west1-a --cluster-version=1.9.3-gke.0 --image-type=UBUNTU
```

NOTE: You must be a Kubernetes admin to follow this guide. If you are not an admin, please contact your local cluster administrator for a client cert, or credentials to pass into the following commands:

```
$ kubectl config set-credentials <username> --username=<admin_username> --password=<admin_password>
$ kubectl config set-context <context_name> --cluster=<cluster_name> --user=<username> --namespace=<namespace>
$ kubectl config use-context <context_name>
```

### Local Setup

You also need the following command line tools:

- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
- [argo](https://github.com/argoproj/argo/blob/master/demo.md#1-download-argo)
- [ksonnet](https://ksonnet.io/#get-started)

To run the client at the end of the example, you must have [requirements.txt](requirements.txt) intalled in your active python environment.

```
pip install -r requirements.txt
```

NOTE: These instructions rely on Github, and may cause issues if behind a firewall with many Github users. Make sure you [generate and export this token](https://help.github.com/articles/creating-a-personal-access-token-for-the-command-line/):

```
export GITHUB_TOKEN=xxxxxxxx
```

## Modifying existing examples

Many examples online use models that are unconfigurable, or don't work well in distributed mode. We will modify one of these [examples](https://github.com/tensorflow/tensorflow/blob/9a24e8acfcd8c9046e1abaac9dbf5e146186f4c2/tensorflow/examples/learn/mnist.py) to be better suited for distributed training and model serving.

### Prepare model

There is a delta between existing distributed mnist examples and what's needed to run well as a TFJob.

Basically, we must:

1. Add options in order to make the model configurable.
1. Use `tf.estimator.train_and_evaluate` to enable model exporting and serving.
1. Define serving signatures for model serving.

The resulting model is [model.py](model.py).

### Build and push model image.

With our code ready, we will now build/push the docker image.

```
DOCKER_BASE_URL=docker.io/elsonrodriguez # Put your docker registry here
docker build . --no-cache -f Dockerfile.model -t ${DOCKER_BASE_URL}/mytfmodel:1.0
docker push ${DOCKER_BASE_URL}/mytfmodel:1.0
```

## Preparing your Kubernetes Cluster

With our data and workloads ready, now the cluster must be prepared. We will be deploying the TF Operator, and Argo to help manage our training job.

In the following instructions we will install our required components to a single namespace. For these instructions we will assume the chosen namespace is `tfworkflow`:

### Deploying Tensorflow Operator and Argo.

We are using the Tensorflow operator to automate the deployment of our distributed model training, and Argo to create the overall training pipeline. The easiest way to install these components on your Kubernetes cluster is by using Kubeflow's ksonnet prototypes.

```
NAMESPACE=tfworkflow
APP_NAME=my-kubeflow
ks init ${APP_NAME}
cd ${APP_NAME}
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/core@1a6fc9d0e19e456b784ba1c23c03ec47648819d0
ks pkg install kubeflow/argo@8d617d68b707d52a5906d38b235e04e540f2fcf7
# Deploy TF Operator and Argo
kubectl create namespace ${NAMESPACE}
ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}
ks generate argo kubeflow-argo --name=kubeflow-argo --namespace=${NAMESPACE}
ks apply default -c kubeflow-core
ks apply default -c kubeflow-argo
# Switch context for the rest of the example
kubectl config set-context $(kubectl config current-context) --namespace=${NAMESPACE}
cd -
# Create a user for our workflow
kubectl apply -f tf-user.yaml
```

You can check to make sure the components have deployed:

```
$ kubectl get pods -l name=tf-job-operator
NAME READY STATUS RESTARTS AGE
tf-job-operator-78757955b-2glvj 1/1 Running 0 1m
$ kubectl get pods -l app=workflow-controller
NAME READY STATUS RESTARTS AGE
workflow-controller-7d8f4bc5df-4zltg 1/1 Running 0 1m
$ kubectl get crd
NAME AGE
tfjobs.kubeflow.org 1m
workflows.argoproj.io 1m
$ argo list
NAME STATUS AGE DURATION
```

### Creating secrets for our workflow
For fetching and uploading data, our workflow requires S3 credentials. These credentials will be provided as kubernetes secrets:

```
export S3_ENDPOINT=s3.us-west-2.amazonaws.com
export AWS_ENDPOINT_URL=https://${S3_ENDPOINT}
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=xxxxx
export BUCKET_NAME=mybucket
kubectl create secret generic aws-creds --from-literal=awsAccessKeyID=${AWS_ACCESS_KEY_ID} \
--from-literal=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY}
```

## Defining your training workflow

This is the bulk of the work, let's walk through what is needed:

1. Train the model
1. Export the model
1. Serve the model

Now let's look at how this is represented in our [example workflow](model-train.yaml)

The argo workflow can be daunting, but basically our steps above extrapolate as follows:

1. `get-workflow-info`: Generate and set variables for consumption in the rest of the pipeline.
1. `tensorboard`: Tensorboard is spawned, configured to watch the S3 URL for the training output.
1. `train-model`: A TFJob is spawned taking in variables such as number of workers, what path the datasets are at, which model container image, etc. The model is exported at the end.
1. `serve-model`: Optionally, the model is served.

With our workflow defined, we can now execute it.

## Submitting your training workflow

First we need to set a few variables in our workflow. Make sure to set your docker registry or remove the `IMAGE` parameters in order to use our defaults:

```
DOCKER_BASE_URL=docker.io/elsonrodriguez # Put your docker registry here
export S3_DATA_URL=s3://${BUCKET_NAME}/data/mnist/
export S3_TRAIN_BASE_URL=s3://${BUCKET_NAME}/models
export AWS_REGION=us-west-2
export JOB_NAME=myjob-$(uuidgen | cut -c -5 | tr '[:upper:]' '[:lower:]')
export TF_MODEL_IMAGE=${DOCKER_BASE_URL}/mytfmodel:1.0
export TF_WORKER=3
export MODEL_TRAIN_STEPS=200
```

Next, submit your workflow.

```
argo submit model-train.yaml -n ${NAMESPACE} --serviceaccount tf-user \
-p aws-endpoint-url=${AWS_ENDPOINT_URL} \
-p s3-endpoint=${S3_ENDPOINT} \
-p aws-region=${AWS_REGION} \
-p tf-model-image=${TF_MODEL_IMAGE} \
-p s3-data-url=${S3_DATA_URL} \
-p s3-train-base-url=${S3_TRAIN_BASE_URL} \
-p job-name=${JOB_NAME} \
-p tf-worker=${TF_WORKER} \
-p model-train-steps=${MODEL_TRAIN_STEPS} \
-p namespace=${NAMESPACE}
```

Your training workflow should now be executing.

You can verify and keep track of your workflow using the argo commands:
```
$ argo list
NAME STATUS AGE DURATION
tf-workflow-h7hwh Running 1h 1h
$ argo get tf-workflow-h7hwh
```

## Monitoring

There are various ways to monitor workflow/training job. In addition to using `kubectl` to query for the status of `pods`, some basic dashboards are also available.

### Argo UI

The Argo UI is useful for seeing what stage your worfklow is in:

```
PODNAME=$(kubectl get pod -l app=argo-ui -n${NAMESPACE} -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward ${PODNAME} 8001:8001
```

You should now be able to visit [http://127.0.0.1:8001](http://127.0.0.1:8001) to see the status of your workflows.

### Tensorboard

Tensorboard is deployed just before training starts. To connect:

```
PODNAME=$(kubectl get pod -l app=tensorboard-${JOB_NAME} -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward ${PODNAME} 6006:6006
```

Tensorboard can now be accessed at [http://127.0.0.1:6006](http://127.0.0.1:6006).

## Using Tensorflow serving

By default the workflow deploys our model via Tensorflow Serving. Included in this example is a client that can query your model and provide results:

```
POD_NAME=$(kubectl get pod -l=app=mnist-${JOB_NAME} -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward ${POD_NAME} 9000:9000 &
TF_MNIST_IMAGE_PATH=data/7.png python mnist_client.py
```

This should result in output similar to this, depending on how well your model was trained:
```
outputs {
key: "classes"
value {
dtype: DT_UINT8
tensor_shape {
dim {
size: 1
}
}
int_val: 7
}
}
outputs {
key: "predictions"
value {
dtype: DT_FLOAT
tensor_shape {
dim {
size: 1
}
dim {
size: 10
}
}
float_val: 0.0
float_val: 0.0
float_val: 0.0
float_val: 0.0
float_val: 0.0
float_val: 0.0
float_val: 0.0
float_val: 1.0
float_val: 0.0
float_val: 0.0
}
}
............................
............................
............................
............................
............................
............................
............................
..............@@@@@@........
..........@@@@@@@@@@........
........@@@@@@@@@@@@........
........@@@@@@@@.@@@........
........@@@@....@@@@........
................@@@@........
...............@@@@.........
...............@@@@.........
...............@@@..........
..............@@@@..........
..............@@@...........
.............@@@@...........
.............@@@............
............@@@@............
............@@@.............
............@@@.............
...........@@@..............
..........@@@@..............
..........@@@@..............
..........@@................
............................
Your model says the above number is... 7!
```

You can also omit `TF_MNIST_IMAGE_PATH`, and the client will pick a random number from the mnist test data. Run it repeatedly and see how your model fares!

### Disabling Serving

Model serving can be turned off by passing in `-p model-serving=false` to the `model-train.yaml` workflow. Then if you wish to serve your model after training, use the `model-deploy.yaml` workflow. Simply pass in the desired finished argo workflow as an argument:

```
WORKFLOW=<the workflowname>
argo submit model-deploy.yaml -n ${NAMESPACE} -p workflow=${WORKFLOW} --serviceaccount=tf-user
```

## Submitting new argo jobs

If you want to rerun your workflow from scratch, then you will need to provide a new `job-name` to the argo workflow. For example:

```
#We're re-using previous variables except JOB_NAME
export JOB_NAME=myawesomejob
argo submit model-train.yaml -n ${NAMESPACE} --serviceaccount tf-user \
-p aws-endpoint-url=${AWS_ENDPOINT_URL} \
-p s3-endpoint=${S3_ENDPOINT} \
-p aws-region=${AWS_REGION} \
-p tf-model-image=${TF_MODEL_IMAGE} \
-p s3-data-url=${S3_DATA_URL} \
-p s3-train-base-url=${S3_TRAIN_BASE_URL} \
-p job-name=${JOB_NAME} \
-p tf-worker=${TF_WORKER} \
-p model-train-steps=${MODEL_TRAIN_STEPS} \
-p namespace=${NAMESPACE}
```

## Conclusion and Next Steps

This is an example of what your machine learning pipeline can look like. Feel free to play with the tunables and see if you can increase your model's accuracy (increasing `model-train-steps` can go a long way).
Binary file added mnist/data/0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added mnist/data/1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added mnist/data/2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added mnist/data/3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added mnist/data/4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added mnist/data/5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added mnist/data/6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added mnist/data/7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added mnist/data/8.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added mnist/data/9.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 1be7ccb

Please sign in to comment.