Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the mnist example work with latest kubeflow #316

Closed
wants to merge 2 commits into from

Conversation

chenzhiwei
Copy link

@chenzhiwei chenzhiwei commented Nov 7, 2018

A better look of the changes: https://github.com/chenzhiwei/kubeflow-examples/tree/mnist-model/mnist

  1. Update the ks version to fix the un-standard k8s version issue ksonnet + openshift picking up wrong k8s version number kubeflow#521
  2. Update the tensorflow version to v1.11.0
  3. Add minio as s3 alternative storage backend
  4. Pre-build kubeflow tf-serving code to ksonnect image(frequently fetch from github will cause github api limit issue)

This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: lluunn

If they are not already assigned, you can assign the PR to them by writing /assign @lluunn in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @chenzhiwei. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jlewi
Copy link
Contributor

jlewi commented Nov 7, 2018

Thanks for doing this please? Can you update the PR description to provide some information about the changes needed? This will make it easier to review the PR.

mnist/README.md Outdated
@@ -1,170 +1,99 @@
# Training MNIST using Kubeflow, S3, and Argo.
# Training MNIST using Kubeflow, S3(Minio), and Argo.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is S3 qualified using Minio? Does it not use S3 directly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this example, I changed the s3 to minio since Minio is more easier for beginners and it does not depends on AWS account anymore.

@jlewi
Copy link
Contributor

jlewi commented Nov 7, 2018

@elsonrodriguez Since you created the initial PR do you want to review this?

mnist/README.md Outdated Show resolved Hide resolved
mnist/README.md Outdated Show resolved Hide resolved
mnist/README.md Outdated Show resolved Hide resolved
mnist/README.md Show resolved Hide resolved
mnist/README.md Outdated

```
PODNAME=$(kubectl get pod -l app=tensorboard-${JOB_NAME} -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward ${PODNAME} 6006:6006
apt install python-pip python-setuptools --no-install-recommends
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to install things on the client?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the cli and webapp depend on them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain? Isn't the web app running in cluster? So if the user connects to the web app they should be able to send predictions using the web app.

The only things they would need to install locally to connect to the web app would be kubectl so they can port-forward.

mnist/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,18 @@
# This container is for running ksonnet within Kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where in the example does running ksonnet in cluster come into play?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the model-train.yaml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is so that we can have an Argo workflow that creates a ksonnet app to deploy the model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the model-train.yaml is a workflow, the ks is run inside the workflow.

mnist/mnist-webapp/app.py Outdated Show resolved Hide resolved
mnist/mnist-webapp/app.py Outdated Show resolved Hide resolved
@chenzhiwei chenzhiwei force-pushed the mnist-model branch 3 times, most recently from be4437e to 64ba17e Compare November 8, 2018 11:47
@jlewi
Copy link
Contributor

jlewi commented Nov 8, 2018

Why add MinIO to the example? If the goal is to provide an alternative to S3/GCS why not use a Kubernetes PV or PVC?

mnist/README.md Show resolved Hide resolved
$ kubectl config set-context <context_name> --cluster=<cluster_name> --user=<username> --namespace=<namespace>
$ kubectl config use-context <context_name>
git clone -b v0.3.2 https://github.com/kubeflow/kubeflow
./kubeflow/scripts/kfctl.sh init kf-app --platform none
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why repeat these instructions as opposed to just referring to the getting started guide?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting started guide always use the latest kubeflow, and may not compatible with this instruction a few versions later. so this is an alternative way to make this instruction work in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our docs are versioned so you can link to a specific version of the docs;
For example here's the link to the 0.3. docs.
https://v0-3.kubeflow.org/docs/started/getting-started/

mnist/README.md Outdated
```
pip install -r requirements.txt
```
Minio provides an S3 compatible API, we will use Minio in this example to act as S3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per other comment; if we want an alternative to S3 why not just use a PV?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, the previous example is using S3. Secondly, creating PV seems more complex than using Minio especially when running multiple mnist models.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does Minio lead to better support for multiple mnist models?

Isn't the point of Minio to allow people to run an Object store on prem?

mnist/README.md Show resolved Hide resolved
mnist/README.md Show resolved Hide resolved
mnist/README.md Outdated

### Argo UI
## 9.Submit serving workflow[optional]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the serving workflow? And why does model serving need to be disabled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example already has this feature and I think it is reasonable to separate the train and serving model. I did not change it so there is not diff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks.

mnist/mnist-webapp/README.md Outdated Show resolved Hide resolved
mnist/README.md Outdated
@@ -6,165 +6,96 @@ This example guides you through the process of taking an example model, modifyin

Before we get started there a few requirements.

### Kubernetes Cluster Environment
## 1.Kubernetes Cluster Environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numerical formatting makes the document difficult to read. It now looks like every step in the guide is a prerequisite.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this numerical format is very helpful for beginners, they just need to follow the steps one by one and finally get a good result.

Any suggestions on this? @jlewi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just make installing the preqruisites step #1?

Also can you add a space between the "." and the text.

gcloud alpha container clusters create ${USER} --enable-kubernetes-alpha --machine-type=n1-standard-8 --num-nodes=3 --disk-size=200 --zone=us-west1-a --cluster-version=1.9.3-gke.0 --image-type=UBUNTU
```

If using Azure, the following will provision a cluster with the required features, [using the az cli](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is compact enough that I feel there's some justification to keep it. Even though it repeats in essence some of the info in the official linked guide, it's more to the point for the purposes of the example, and it also includes Azure instructions, which are currently missing from the official getting started guide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, as @jlewi suggested, put things into right place. I'd like to put the azure instruction in kubeflow getting started document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

mnist/README.md Outdated

To run the client at the end of the example, you must have [requirements.txt](requirements.txt) intalled in your active python environment.
## 4.Setup Minio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that recommending minio (while it is an excellent option) as the default setup is not setting the right tone for pragmatic engineering. Unless there are specific requirements, or your environment has no S3-compatible object store, there isn't a need for Minio. It just becomes one more moving piece in an already complicated example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for most beginners, they don't have anything, such as AWS account and other S3 compatible storage. It is very helpful for them to use Minio.

For skilled person, they already know how to tweak this to use their own solution.

The guide should be simple enough and easily for non-skilled person who can just follow the steps to complete this example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @elsonrodriguez to not recommending minio for beginners. That seems like a distraction. If users don't have a Cloud account they are probably running on minikube/microk8s/docker. So I think the best option is to just use a PV even if that means running non-distributed.

```

## Modifying existing examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually took a bit of work to get the upstream mnist example to work decently. There's educational value in highlighting some of the shortcomings of the canned examples floating around in regards to configurability and scalability.


```
DOCKER_BASE_URL=docker.io/elsonrodriguez # Put your docker registry here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this step makes the origin of the model container opaque, as stated by Jeremy in one of his reviews.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a dockerfile directory inside this example, and the steps are located there.

Later when I move the dockerfile to the location @jlewi suggested, I will update it to reference the link.

mnist/README.md Outdated
export AWS_ENDPOINT_URL=https://${S3_ENDPOINT} #use http instead of https for default minio installs
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=xxxxx
export S3_ENDPOINT=192.168.100.155:9000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of assumptions about the network configuration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in line 78, there is a description about this.

@@ -36,10 +36,6 @@ spec:
value: true
- name: model-serving-servicetype
value: ClusterIP
- name: model-serving-ks-url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pins the version so that the examples continue to work despite any changes in master. Otherwise the example can break depending on code changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I build the example code to docker image, so when serving the model, it will not call github api to fetch code, this can solve the github api limit issue.

Build-in the code(54KB) to docker image can also helps in offline deploy mode.

@elsonrodriguez
Copy link
Contributor

@chenzhiwei It looks like a lot of testing/validation went into creating this PR, so I hate to say that it needs to be reworked.

The scope is way beyond intend purpose (updating the example so it works with newer versions of kubeflow), and some of the changes are breaking (hard-coded IPs, unpinning kubeflow versions, removal of azure instructions) or questionable (defaulting to minio, adding a web client)

I think a new PR should be opened with just the version fixes in place, and the other suggestions (like the web client, or flow changes) should be tackled separately.

@jlewi Regarding the Argo bits, the intention originally was to demonstrate what an automated ML pipeline might look like. However while I think there's value in that, it is also pretty steep for a newcomer to this project to have to grok Argo before doing some mnist. I'd endorse an attempt at taking the argo bits out and letting the user do everything manually, while also preserving the argo bits somewhere to give automation folks something to work with.

@dsdinter
Copy link
Contributor

dsdinter commented Nov 8, 2018

Thanks again so much for taking the time to cleanup and fix this example!

A high level question for you; what do you think we should highlight and try to show in this example?

Should the goal of this example be to have a simple example that shows how to train and deploy a model using Kubeflow?

Should we keep the Argo bits? The Argo workflow right now is creating a ksonnet app and adding TFServing to it. What do you think of this pattern? I think the more common case would be people would have a ksonnet application checked in somewhere and then just apply the components in that app; rather than creating an app at run time.

I think the pattern we'd probably recommend is using GitOps
https://www.kubeflow.org/docs/guides/gitops-for-kubeflow/

If we remove the Argo bits; that might simplify the example and make it more suited as a getting started example for users.

@dsdinter WDYT? You are basically working on a similar example for PyTorch: #274 ?

/cc @texasmichelle

Thanks, I agree about keeping the examples without the Argo workflows to keep them simple (At least with less moving parts).
For my example I created a ksonnet application and the components, then removed my test environment to avoid keeping my own residual tests.
We can then have another folder that keeps all Argo workflows about the examples consolidated, which can then be used to run our E2E tests? Pretty sure we can reuse the workflows hence suggesting a common folder, that will enforce some consistency about workflows across all examples.

Copy link
Author

@chenzhiwei chenzhiwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elsonrodriguez @jlewi thank you for reviewing this PR, I will update it according to your comments.

mnist/README.md Outdated Show resolved Hide resolved
mnist/README.md Outdated
export AWS_ENDPOINT_URL=https://${S3_ENDPOINT} #use http instead of https for default minio installs
export AWS_ACCESS_KEY_ID=xxxxx
export AWS_SECRET_ACCESS_KEY=xxxxx
export S3_ENDPOINT=192.168.100.155:9000
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in line 78, there is a description about this.

mnist/README.md Outdated

### Argo UI
## 9.Submit serving workflow[optional]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this example already has this feature and I think it is reasonable to separate the train and serving model. I did not change it so there is not diff.

mnist/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,18 @@
# This container is for running ksonnet within Kubernetes
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the model-train.yaml is a workflow, the ks is run inside the workflow.

mnist/dockerfile/README.md Show resolved Hide resolved
#!/bin/bash

#FIXME this is set to tf-user by default in order to work around lack of Serviceaccount setting in argo.
SERVICE_ACCOUNT=${SERVICE_ACCOUNT:-tf-user}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we don't need this. the env var is set in pod spec.

and this line does not use export, the actually command will not recognize this. I leave this unchanged in this PR.

mnist/mnist-webapp/README.md Outdated Show resolved Hide resolved
mnist/model-train.yaml Show resolved Hide resolved
@@ -36,10 +36,6 @@ spec:
value: true
- name: model-serving-servicetype
value: ClusterIP
- name: model-serving-ks-url
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I build the example code to docker image, so when serving the model, it will not call github api to fetch code, this can solve the github api limit issue.

Build-in the code(54KB) to docker image can also helps in offline deploy mode.

Copy link
Author

@chenzhiwei chenzhiwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submit new patchset to remove the webapp(will create separate PR for this) and addressed the comments.

mnist/README.md Show resolved Hide resolved
mnist/README.md Outdated Show resolved Hide resolved
@jlewi
Copy link
Contributor

jlewi commented Nov 14, 2018

Sorry for the slow reply I'm at Kubecon this week so I haven't had much time. Will try to review this as soon as i can.

@dsdinter Any interest in looking at this in the meantime?

Copy link
Contributor

@dsdinter dsdinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 23 unresolved discussions (waiting on @jlewi, @chenzhiwei, @elsonrodriguez, @cwbeitel, and @texasmichelle)


mnist/README.md, line 37 at r2 (raw file):

Previously, chenzhiwei (Chen Zhiwei) wrote…

Firstly, the previous example is using S3. Secondly, creating PV seems more complex than using Minio especially when running multiple mnist models.

Why not using PV compatible types rather object storage if you want to be more cloud agnostic?
You can then provide Ksonnet component to configure local single node with HostPath, NFS if on-prem K8S cluster and with whatever cloud provider you prefer (Filestore for GCP, EBS for AWS, etc...).
You could use the Cloud Filestore instructions as baseline.


mnist/README.md, line 42 at r2 (raw file):

Previously, chenzhiwei (Chen Zhiwei) wrote…

Running in the cluster need PV to persist the data, usually in demo environment there is no out-of-box PV.

This guide is for newbies, I want to ensure they can follow the steps one by one to run this example. For skilled person, they already know how to tweak the steps to use their own solution, such as they may you Minio helm chart to deploy their own Minio.

As mentioned above I would rather use PV if running in a local single node with HostPath then suggesting how to use other supported PV types for other environments.


mnist/README.md, line 81 at r2 (raw file):

Previously, chenzhiwei (Chen Zhiwei) wrote…

in line 78, there is a description about this.

As mentioned above I would rather use PV if running in a local single node with HostPath then suggesting how to use other supported PV types for other environments.

@jlewi
Copy link
Contributor

jlewi commented Nov 15, 2018

@chenzhiwei I had a chance to go through all the comments again.

I'd really like to simplify the mnist example and make it suitable as a quick start for new users. For those reasons; I'd like to do the following

  1. Find a simpler alternative to telling folks to run minio
  2. Remove the Argo bits
    • As @elsonrodriguez this makes things more complicated.
    • Also now that we have kubeflow/pipelines I think we should consider using that rather than using Argo directly.

@chenzhiwei that is probably more work than you signed up for :) and I don't want to push all that onto you.

So you let me know what changes you want to make and we can scope the PR to those changes? Looking at the initial PR description; I'd be ok with a PR for items 1,2, and/or 4.

The Minio addition is the only major change I'm really unsure of. If you really want to keep the MinIO piece; can we make it optional? Maybe move it to an Appendix and just list it as an option for folks who don't have an object store already?

Thank you again so much for your patience.

1. Update the ks version to fix the un-standard k8s version issue
2. Update the tensorflow version to v1.11.0
3. Add minio as s3 alternative storage backend
4. Pre-build kubeflow tf-serving code to ksonnect image
@chenzhiwei chenzhiwei force-pushed the mnist-model branch 2 times, most recently from a1f0d3d to b3959c1 Compare November 20, 2018 07:43
@chenzhiwei
Copy link
Author

Actually I want to create a new mnist example at first, but I don't want make duplicated work, so I chose a harder way which is creating this PR.

Do you agree to create a new mnist example with the new versions and remove the argo(I agree that argo makes things complex)?

@jlewi
Copy link
Contributor

jlewi commented Nov 20, 2018

I'd like a single mnist example. Whether we get there by "refactoring" the existing sample; or deleting the current one and creating a new one; doesn't really matter to me.

I support the idea of removing Argo.

@texasmichelle
Copy link
Member

There's a codelab that uses the existing example, but as long as that gets updated simultaneously I'm fine with replacing. I agree that a single mnist example makes the most sense.

@chenzhiwei could you either update or close this PR? Breaking it down into a number of smaller issues/PRs is always an option.

@chenzhiwei
Copy link
Author

sorry, I have no time to work on this now.

@chenzhiwei chenzhiwei closed this Jun 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants