Make the mnist example work with latest kubeflow #316

chenzhiwei · 2018-11-07T10:06:09Z

A better look of the changes: https://github.com/chenzhiwei/kubeflow-examples/tree/mnist-model/mnist

Update the ks version to fix the un-standard k8s version issue ksonnet + openshift picking up wrong k8s version number kubeflow#521
Update the tensorflow version to v1.11.0
Add minio as s3 alternative storage backend
Pre-build kubeflow tf-serving code to ksonnect image(frequently fetch from github will cause github api limit issue)

This change is

k8s-ci-robot · 2018-11-07T10:06:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: lluunn

If they are not already assigned, you can assign the PR to them by writing /assign @lluunn in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2018-11-07T10:06:23Z

Hi @chenzhiwei. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jlewi · 2018-11-07T22:05:39Z

Thanks for doing this please? Can you update the PR description to provide some information about the changes needed? This will make it easier to review the PR.

jlewi · 2018-11-07T22:06:15Z

mnist/README.md

@@ -1,170 +1,99 @@
-# Training MNIST using Kubeflow, S3, and Argo.
+# Training MNIST using Kubeflow, S3(Minio), and Argo.


Why is S3 qualified using Minio? Does it not use S3 directly?

in this example, I changed the s3 to minio since Minio is more easier for beginners and it does not depends on AWS account anymore.

jlewi · 2018-11-07T22:07:13Z

@elsonrodriguez Since you created the initial PR do you want to review this?

mnist/README.md

jlewi · 2018-11-07T22:11:18Z

mnist/README.md


 ```
-PODNAME=$(kubectl get pod -l app=tensorboard-${JOB_NAME} -o jsonpath='{.items[0].metadata.name}')
-kubectl port-forward ${PODNAME} 6006:6006
+apt install python-pip python-setuptools --no-install-recommends


Why do you need to install things on the client?

because the cli and webapp depend on them.

Can you explain? Isn't the web app running in cluster? So if the user connects to the web app they should be able to send predictions using the web app.

The only things they would need to install locally to connect to the web app would be kubectl so they can port-forward.

mnist/README.md

jlewi · 2018-11-07T22:16:10Z

mnist/dockerfile/Dockerfile.ksonnet

@@ -0,0 +1,18 @@
+# This container is for running ksonnet within Kubernetes


Where in the example does running ksonnet in cluster come into play?

in the model-train.yaml.

So this is so that we can have an Argo workflow that creates a ksonnet app to deploy the model?

yes, the model-train.yaml is a workflow, the ks is run inside the workflow.

mnist/dockerfile/README.md

mnist/mnist-webapp/app.py

jlewi · 2018-11-08T14:59:46Z

Why add MinIO to the example? If the goal is to provide an alternative to S3/GCS why not use a Kubernetes PV or PVC?

mnist/README.md

jlewi · 2018-11-08T15:01:11Z

mnist/README.md

-$ kubectl config set-context <context_name> --cluster=<cluster_name> --user=<username> --namespace=<namespace>
-$ kubectl config use-context <context_name>
+git clone -b v0.3.2 https://github.com/kubeflow/kubeflow
+./kubeflow/scripts/kfctl.sh init kf-app --platform none


Why repeat these instructions as opposed to just referring to the getting started guide?

getting started guide always use the latest kubeflow, and may not compatible with this instruction a few versions later. so this is an alternative way to make this instruction work in future.

Our docs are versioned so you can link to a specific version of the docs;
For example here's the link to the 0.3. docs.
https://v0-3.kubeflow.org/docs/started/getting-started/

jlewi · 2018-11-08T15:01:33Z

mnist/README.md

-```
-pip install -r requirements.txt
-```
+Minio provides an S3 compatible API, we will use Minio in this example to act as S3.


Per other comment; if we want an alternative to S3 why not just use a PV?

Firstly, the previous example is using S3. Secondly, creating PV seems more complex than using Minio especially when running multiple mnist models.

How does Minio lead to better support for multiple mnist models?

Isn't the point of Minio to allow people to run an Object store on prem?

mnist/README.md

jlewi · 2018-11-08T15:07:59Z

mnist/README.md


-### Argo UI
+## 9.Submit serving workflow[optional]


What is the serving workflow? And why does model serving need to be disabled?

this example already has this feature and I think it is reasonable to separate the train and serving model. I did not change it so there is not diff.

mnist/mnist-webapp/README.md

elsonrodriguez · 2018-11-08T17:58:55Z

mnist/README.md

@@ -6,165 +6,96 @@ This example guides you through the process of taking an example model, modifyin

 Before we get started there a few requirements.

-### Kubernetes Cluster Environment
+## 1.Kubernetes Cluster Environment


The numerical formatting makes the document difficult to read. It now looks like every step in the guide is a prerequisite.

this numerical format is very helpful for beginners, they just need to follow the steps one by one and finally get a good result.

Any suggestions on this? @jlewi

Can we just make installing the preqruisites step #1?

Also can you add a space between the "." and the text.

elsonrodriguez · 2018-11-08T18:00:39Z

mnist/README.md

-gcloud alpha container clusters create ${USER} --enable-kubernetes-alpha --machine-type=n1-standard-8 --num-nodes=3 --disk-size=200 --zone=us-west1-a --cluster-version=1.9.3-gke.0 --image-type=UBUNTU
-```
-
-If using Azure, the following will provision a cluster with the required features, [using the az cli](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest):


This section is compact enough that I feel there's some justification to keep it. Even though it repeats in essence some of the info in the official linked guide, it's more to the point for the purposes of the example, and it also includes Azure instructions, which are currently missing from the official getting started guide.

Hi, as @jlewi suggested, put things into right place. I'd like to put the azure instruction in kubeflow getting started document.

elsonrodriguez · 2018-11-08T18:05:16Z

mnist/README.md


-To run the client at the end of the example, you must have [requirements.txt](requirements.txt) intalled in your active python environment.
+## 4.Setup Minio


I feel that recommending minio (while it is an excellent option) as the default setup is not setting the right tone for pragmatic engineering. Unless there are specific requirements, or your environment has no S3-compatible object store, there isn't a need for Minio. It just becomes one more moving piece in an already complicated example.

for most beginners, they don't have anything, such as AWS account and other S3 compatible storage. It is very helpful for them to use Minio.

For skilled person, they already know how to tweak this to use their own solution.

The guide should be simple enough and easily for non-skilled person who can just follow the steps to complete this example.

+1 to @elsonrodriguez to not recommending minio for beginners. That seems like a distraction. If users don't have a Cloud account they are probably running on minikube/microk8s/docker. So I think the best option is to just use a PV even if that means running non-distributed.

elsonrodriguez · 2018-11-08T18:07:01Z

mnist/README.md

 ```

-## Modifying existing examples


It actually took a bit of work to get the upstream mnist example to work decently. There's educational value in highlighting some of the shortcomings of the canned examples floating around in regards to configurability and scalability.

elsonrodriguez · 2018-11-08T18:08:27Z

mnist/README.md


 ```
-DOCKER_BASE_URL=docker.io/elsonrodriguez # Put your docker registry here


Removing this step makes the origin of the model container opaque, as stated by Jeremy in one of his reviews.

There is already a dockerfile directory inside this example, and the steps are located there.

Later when I move the dockerfile to the location @jlewi suggested, I will update it to reference the link.

elsonrodriguez · 2018-11-08T18:09:45Z

mnist/README.md

-export AWS_ENDPOINT_URL=https://${S3_ENDPOINT} #use http instead of https for default minio installs
-export AWS_ACCESS_KEY_ID=xxxxx
-export AWS_SECRET_ACCESS_KEY=xxxxx
+export S3_ENDPOINT=192.168.100.155:9000


This makes a lot of assumptions about the network configuration.

in line 78, there is a description about this.

elsonrodriguez · 2018-11-08T18:15:29Z

mnist/model-train.yaml

@@ -36,10 +36,6 @@ spec:
      value: true
    - name: model-serving-servicetype
      value: ClusterIP
-    - name: model-serving-ks-url


This pins the version so that the examples continue to work despite any changes in master. Otherwise the example can break depending on code changes.

I build the example code to docker image, so when serving the model, it will not call github api to fetch code, this can solve the github api limit issue.

Build-in the code(54KB) to docker image can also helps in offline deploy mode.

elsonrodriguez · 2018-11-08T18:30:04Z

@chenzhiwei It looks like a lot of testing/validation went into creating this PR, so I hate to say that it needs to be reworked.

The scope is way beyond intend purpose (updating the example so it works with newer versions of kubeflow), and some of the changes are breaking (hard-coded IPs, unpinning kubeflow versions, removal of azure instructions) or questionable (defaulting to minio, adding a web client)

I think a new PR should be opened with just the version fixes in place, and the other suggestions (like the web client, or flow changes) should be tackled separately.

@jlewi Regarding the Argo bits, the intention originally was to demonstrate what an automated ML pipeline might look like. However while I think there's value in that, it is also pretty steep for a newcomer to this project to have to grok Argo before doing some mnist. I'd endorse an attempt at taking the argo bits out and letting the user do everything manually, while also preserving the argo bits somewhere to give automation folks something to work with.

dsdinter · 2018-11-08T22:16:34Z

Thanks again so much for taking the time to cleanup and fix this example!

A high level question for you; what do you think we should highlight and try to show in this example?

Should the goal of this example be to have a simple example that shows how to train and deploy a model using Kubeflow?

Should we keep the Argo bits? The Argo workflow right now is creating a ksonnet app and adding TFServing to it. What do you think of this pattern? I think the more common case would be people would have a ksonnet application checked in somewhere and then just apply the components in that app; rather than creating an app at run time.

I think the pattern we'd probably recommend is using GitOps
https://www.kubeflow.org/docs/guides/gitops-for-kubeflow/

If we remove the Argo bits; that might simplify the example and make it more suited as a getting started example for users.

@dsdinter WDYT? You are basically working on a similar example for PyTorch: #274 ?

/cc @texasmichelle

Thanks, I agree about keeping the examples without the Argo workflows to keep them simple (At least with less moving parts).
For my example I created a ksonnet application and the components, then removed my test environment to avoid keeping my own residual tests.
We can then have another folder that keeps all Argo workflows about the examples consolidated, which can then be used to run our E2E tests? Pretty sure we can reuse the workflows hence suggesting a common folder, that will enforce some consistency about workflows across all examples.

chenzhiwei

@elsonrodriguez @jlewi thank you for reviewing this PR, I will update it according to your comments.

mnist/README.md

chenzhiwei · 2018-11-09T02:39:35Z

mnist/README.md

-export AWS_ENDPOINT_URL=https://${S3_ENDPOINT} #use http instead of https for default minio installs
-export AWS_ACCESS_KEY_ID=xxxxx
-export AWS_SECRET_ACCESS_KEY=xxxxx
+export S3_ENDPOINT=192.168.100.155:9000


in line 78, there is a description about this.

chenzhiwei · 2018-11-09T02:41:08Z

mnist/README.md


-### Argo UI
+## 9.Submit serving workflow[optional]


this example already has this feature and I think it is reasonable to separate the train and serving model. I did not change it so there is not diff.

mnist/README.md

chenzhiwei · 2018-11-09T02:45:45Z

mnist/dockerfile/Dockerfile.ksonnet

@@ -0,0 +1,18 @@
+# This container is for running ksonnet within Kubernetes


yes, the model-train.yaml is a workflow, the ks is run inside the workflow.

mnist/dockerfile/README.md

chenzhiwei · 2018-11-09T02:51:00Z

mnist/dockerfile/ksonnet-entrypoint.sh

+#!/bin/bash
+
+#FIXME this is set to tf-user by default in order to work around lack of Serviceaccount setting in argo.
+SERVICE_ACCOUNT=${SERVICE_ACCOUNT:-tf-user}


actually we don't need this. the env var is set in pod spec.

and this line does not use export, the actually command will not recognize this. I leave this unchanged in this PR.

mnist/mnist-webapp/README.md

mnist/model-train.yaml

chenzhiwei · 2018-11-09T02:57:20Z

mnist/model-train.yaml

@@ -36,10 +36,6 @@ spec:
      value: true
    - name: model-serving-servicetype
      value: ClusterIP
-    - name: model-serving-ks-url


I build the example code to docker image, so when serving the model, it will not call github api to fetch code, this can solve the github api limit issue.

Build-in the code(54KB) to docker image can also helps in offline deploy mode.

chenzhiwei

Submit new patchset to remove the webapp(will create separate PR for this) and addressed the comments.

mnist/README.md

jlewi · 2018-11-14T21:40:39Z

Sorry for the slow reply I'm at Kubecon this week so I haven't had much time. Will try to review this as soon as i can.

@dsdinter Any interest in looking at this in the meantime?

dsdinter

Reviewable status: 0 of 10 files reviewed, 23 unresolved discussions (waiting on @jlewi, @chenzhiwei, @elsonrodriguez, @cwbeitel, and @texasmichelle)

mnist/README.md, line 37 at r2 (raw file):

Previously, chenzhiwei (Chen Zhiwei) wrote…

Firstly, the previous example is using S3. Secondly, creating PV seems more complex than using Minio especially when running multiple mnist models.

Why not using PV compatible types rather object storage if you want to be more cloud agnostic?
You can then provide Ksonnet component to configure local single node with HostPath, NFS if on-prem K8S cluster and with whatever cloud provider you prefer (Filestore for GCP, EBS for AWS, etc...).
You could use the Cloud Filestore instructions as baseline.

mnist/README.md, line 42 at r2 (raw file):

Previously, chenzhiwei (Chen Zhiwei) wrote…

Running in the cluster need PV to persist the data, usually in demo environment there is no out-of-box PV.

This guide is for newbies, I want to ensure they can follow the steps one by one to run this example. For skilled person, they already know how to tweak the steps to use their own solution, such as they may you Minio helm chart to deploy their own Minio.

As mentioned above I would rather use PV if running in a local single node with HostPath then suggesting how to use other supported PV types for other environments.

mnist/README.md, line 81 at r2 (raw file):

Previously, chenzhiwei (Chen Zhiwei) wrote…

in line 78, there is a description about this.

As mentioned above I would rather use PV if running in a local single node with HostPath then suggesting how to use other supported PV types for other environments.

jlewi · 2018-11-15T10:56:55Z

@chenzhiwei I had a chance to go through all the comments again.

I'd really like to simplify the mnist example and make it suitable as a quick start for new users. For those reasons; I'd like to do the following

Find a simpler alternative to telling folks to run minio
Remove the Argo bits
- As @elsonrodriguez this makes things more complicated.
- Also now that we have kubeflow/pipelines I think we should consider using that rather than using Argo directly.

@chenzhiwei that is probably more work than you signed up for :) and I don't want to push all that onto you.

So you let me know what changes you want to make and we can scope the PR to those changes? Looking at the initial PR description; I'd be ok with a PR for items 1,2, and/or 4.

The Minio addition is the only major change I'm really unsure of. If you really want to keep the MinIO piece; can we make it optional? Maybe move it to an Appendix and just list it as an option for folks who don't have an object store already?

Thank you again so much for your patience.

1. Update the ks version to fix the un-standard k8s version issue 2. Update the tensorflow version to v1.11.0 3. Add minio as s3 alternative storage backend 4. Pre-build kubeflow tf-serving code to ksonnect image

chenzhiwei · 2018-11-20T07:52:41Z

Actually I want to create a new mnist example at first, but I don't want make duplicated work, so I chose a harder way which is creating this PR.

Do you agree to create a new mnist example with the new versions and remove the argo(I agree that argo makes things complex)?

jlewi · 2018-11-20T20:38:47Z

I'd like a single mnist example. Whether we get there by "refactoring" the existing sample; or deleting the current one and creating a new one; doesn't really matter to me.

I support the idea of removing Argo.

texasmichelle · 2019-06-14T22:58:28Z

There's a codelab that uses the existing example, but as long as that gets updated simultaneously I'm fine with replacing. I agree that a single mnist example makes the most sense.

@chenzhiwei could you either update or close this PR? Breaking it down into a number of smaller issues/PRs is always an option.

chenzhiwei · 2019-06-15T01:07:08Z

sorry, I have no time to work on this now.

k8s-ci-robot requested review from cwbeitel and jlewi November 7, 2018 10:06

k8s-ci-robot added the size/XL label Nov 7, 2018

k8s-ci-robot added the needs-ok-to-test label Nov 7, 2018