-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the mnist example work with latest kubeflow #316
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @chenzhiwei. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks for doing this please? Can you update the PR description to provide some information about the changes needed? This will make it easier to review the PR. |
mnist/README.md
Outdated
@@ -1,170 +1,99 @@ | |||
# Training MNIST using Kubeflow, S3, and Argo. | |||
# Training MNIST using Kubeflow, S3(Minio), and Argo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is S3 qualified using Minio? Does it not use S3 directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this example, I changed the s3 to minio since Minio is more easier for beginners and it does not depends on AWS account anymore.
@elsonrodriguez Since you created the initial PR do you want to review this? |
mnist/README.md
Outdated
|
||
``` | ||
PODNAME=$(kubectl get pod -l app=tensorboard-${JOB_NAME} -o jsonpath='{.items[0].metadata.name}') | ||
kubectl port-forward ${PODNAME} 6006:6006 | ||
apt install python-pip python-setuptools --no-install-recommends |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to install things on the client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because the cli and webapp depend on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain? Isn't the web app running in cluster? So if the user connects to the web app they should be able to send predictions using the web app.
The only things they would need to install locally to connect to the web app would be kubectl so they can port-forward.
@@ -0,0 +1,18 @@ | |||
# This container is for running ksonnet within Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where in the example does running ksonnet in cluster come into play?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the model-train.yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is so that we can have an Argo workflow that creates a ksonnet app to deploy the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the model-train.yaml is a workflow, the ks is run inside the workflow.
be4437e
to
64ba17e
Compare
Why add MinIO to the example? If the goal is to provide an alternative to S3/GCS why not use a Kubernetes PV or PVC? |
$ kubectl config set-context <context_name> --cluster=<cluster_name> --user=<username> --namespace=<namespace> | ||
$ kubectl config use-context <context_name> | ||
git clone -b v0.3.2 https://github.com/kubeflow/kubeflow | ||
./kubeflow/scripts/kfctl.sh init kf-app --platform none |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why repeat these instructions as opposed to just referring to the getting started guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting started guide always use the latest kubeflow, and may not compatible with this instruction a few versions later. so this is an alternative way to make this instruction work in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our docs are versioned so you can link to a specific version of the docs;
For example here's the link to the 0.3. docs.
https://v0-3.kubeflow.org/docs/started/getting-started/
mnist/README.md
Outdated
``` | ||
pip install -r requirements.txt | ||
``` | ||
Minio provides an S3 compatible API, we will use Minio in this example to act as S3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per other comment; if we want an alternative to S3 why not just use a PV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, the previous example is using S3. Secondly, creating PV seems more complex than using Minio especially when running multiple mnist models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does Minio lead to better support for multiple mnist models?
Isn't the point of Minio to allow people to run an Object store on prem?
mnist/README.md
Outdated
|
||
### Argo UI | ||
## 9.Submit serving workflow[optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the serving workflow? And why does model serving need to be disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example already has this feature and I think it is reasonable to separate the train and serving model. I did not change it so there is not diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks.
mnist/README.md
Outdated
@@ -6,165 +6,96 @@ This example guides you through the process of taking an example model, modifyin | |||
|
|||
Before we get started there a few requirements. | |||
|
|||
### Kubernetes Cluster Environment | |||
## 1.Kubernetes Cluster Environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numerical formatting makes the document difficult to read. It now looks like every step in the guide is a prerequisite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this numerical format is very helpful for beginners, they just need to follow the steps one by one and finally get a good result.
Any suggestions on this? @jlewi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just make installing the preqruisites step #1?
Also can you add a space between the "." and the text.
gcloud alpha container clusters create ${USER} --enable-kubernetes-alpha --machine-type=n1-standard-8 --num-nodes=3 --disk-size=200 --zone=us-west1-a --cluster-version=1.9.3-gke.0 --image-type=UBUNTU | ||
``` | ||
|
||
If using Azure, the following will provision a cluster with the required features, [using the az cli](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is compact enough that I feel there's some justification to keep it. Even though it repeats in essence some of the info in the official linked guide, it's more to the point for the purposes of the example, and it also includes Azure instructions, which are currently missing from the official getting started guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, as @jlewi suggested, put things into right place. I'd like to put the azure instruction in kubeflow getting started document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
mnist/README.md
Outdated
|
||
To run the client at the end of the example, you must have [requirements.txt](requirements.txt) intalled in your active python environment. | ||
## 4.Setup Minio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that recommending minio (while it is an excellent option) as the default setup is not setting the right tone for pragmatic engineering. Unless there are specific requirements, or your environment has no S3-compatible object store, there isn't a need for Minio. It just becomes one more moving piece in an already complicated example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for most beginners, they don't have anything, such as AWS account and other S3 compatible storage. It is very helpful for them to use Minio.
For skilled person, they already know how to tweak this to use their own solution.
The guide should be simple enough and easily for non-skilled person who can just follow the steps to complete this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @elsonrodriguez to not recommending minio for beginners. That seems like a distraction. If users don't have a Cloud account they are probably running on minikube/microk8s/docker. So I think the best option is to just use a PV even if that means running non-distributed.
``` | ||
|
||
## Modifying existing examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually took a bit of work to get the upstream mnist example to work decently. There's educational value in highlighting some of the shortcomings of the canned examples floating around in regards to configurability and scalability.
|
||
``` | ||
DOCKER_BASE_URL=docker.io/elsonrodriguez # Put your docker registry here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this step makes the origin of the model container opaque, as stated by Jeremy in one of his reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a dockerfile directory inside this example, and the steps are located there.
Later when I move the dockerfile to the location @jlewi suggested, I will update it to reference the link.
mnist/README.md
Outdated
export AWS_ENDPOINT_URL=https://${S3_ENDPOINT} #use http instead of https for default minio installs | ||
export AWS_ACCESS_KEY_ID=xxxxx | ||
export AWS_SECRET_ACCESS_KEY=xxxxx | ||
export S3_ENDPOINT=192.168.100.155:9000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of assumptions about the network configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in line 78, there is a description about this.
@@ -36,10 +36,6 @@ spec: | |||
value: true | |||
- name: model-serving-servicetype | |||
value: ClusterIP | |||
- name: model-serving-ks-url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pins the version so that the examples continue to work despite any changes in master. Otherwise the example can break depending on code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I build the example code to docker image, so when serving the model, it will not call github api to fetch code, this can solve the github api limit issue.
Build-in the code(54KB) to docker image can also helps in offline deploy mode.
@chenzhiwei It looks like a lot of testing/validation went into creating this PR, so I hate to say that it needs to be reworked. The scope is way beyond intend purpose (updating the example so it works with newer versions of kubeflow), and some of the changes are breaking (hard-coded IPs, unpinning kubeflow versions, removal of azure instructions) or questionable (defaulting to minio, adding a web client) I think a new PR should be opened with just the version fixes in place, and the other suggestions (like the web client, or flow changes) should be tackled separately. @jlewi Regarding the Argo bits, the intention originally was to demonstrate what an automated ML pipeline might look like. However while I think there's value in that, it is also pretty steep for a newcomer to this project to have to grok Argo before doing some mnist. I'd endorse an attempt at taking the argo bits out and letting the user do everything manually, while also preserving the argo bits somewhere to give automation folks something to work with. |
Thanks, I agree about keeping the examples without the Argo workflows to keep them simple (At least with less moving parts). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elsonrodriguez @jlewi thank you for reviewing this PR, I will update it according to your comments.
mnist/README.md
Outdated
export AWS_ENDPOINT_URL=https://${S3_ENDPOINT} #use http instead of https for default minio installs | ||
export AWS_ACCESS_KEY_ID=xxxxx | ||
export AWS_SECRET_ACCESS_KEY=xxxxx | ||
export S3_ENDPOINT=192.168.100.155:9000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in line 78, there is a description about this.
mnist/README.md
Outdated
|
||
### Argo UI | ||
## 9.Submit serving workflow[optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example already has this feature and I think it is reasonable to separate the train and serving model. I did not change it so there is not diff.
@@ -0,0 +1,18 @@ | |||
# This container is for running ksonnet within Kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the model-train.yaml is a workflow, the ks is run inside the workflow.
#!/bin/bash | ||
|
||
#FIXME this is set to tf-user by default in order to work around lack of Serviceaccount setting in argo. | ||
SERVICE_ACCOUNT=${SERVICE_ACCOUNT:-tf-user} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we don't need this. the env var is set in pod spec.
and this line does not use export
, the actually command will not recognize this. I leave this unchanged in this PR.
@@ -36,10 +36,6 @@ spec: | |||
value: true | |||
- name: model-serving-servicetype | |||
value: ClusterIP | |||
- name: model-serving-ks-url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I build the example code to docker image, so when serving the model, it will not call github api to fetch code, this can solve the github api limit issue.
Build-in the code(54KB) to docker image can also helps in offline deploy mode.
64ba17e
to
99c2863
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submit new patchset to remove the webapp(will create separate PR for this) and addressed the comments.
Sorry for the slow reply I'm at Kubecon this week so I haven't had much time. Will try to review this as soon as i can. @dsdinter Any interest in looking at this in the meantime? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 23 unresolved discussions (waiting on @jlewi, @chenzhiwei, @elsonrodriguez, @cwbeitel, and @texasmichelle)
mnist/README.md, line 37 at r2 (raw file):
Previously, chenzhiwei (Chen Zhiwei) wrote…
Firstly, the previous example is using S3. Secondly, creating PV seems more complex than using Minio especially when running multiple mnist models.
Why not using PV compatible types rather object storage if you want to be more cloud agnostic?
You can then provide Ksonnet component to configure local single node with HostPath, NFS if on-prem K8S cluster and with whatever cloud provider you prefer (Filestore for GCP, EBS for AWS, etc...).
You could use the Cloud Filestore instructions as baseline.
mnist/README.md, line 42 at r2 (raw file):
Previously, chenzhiwei (Chen Zhiwei) wrote…
Running in the cluster need PV to persist the data, usually in demo environment there is no out-of-box PV.
This guide is for newbies, I want to ensure they can follow the steps one by one to run this example. For skilled person, they already know how to tweak the steps to use their own solution, such as they may you Minio helm chart to deploy their own Minio.
As mentioned above I would rather use PV if running in a local single node with HostPath then suggesting how to use other supported PV types for other environments.
mnist/README.md, line 81 at r2 (raw file):
Previously, chenzhiwei (Chen Zhiwei) wrote…
in line 78, there is a description about this.
As mentioned above I would rather use PV if running in a local single node with HostPath then suggesting how to use other supported PV types for other environments.
@chenzhiwei I had a chance to go through all the comments again. I'd really like to simplify the mnist example and make it suitable as a quick start for new users. For those reasons; I'd like to do the following
@chenzhiwei that is probably more work than you signed up for :) and I don't want to push all that onto you. So you let me know what changes you want to make and we can scope the PR to those changes? Looking at the initial PR description; I'd be ok with a PR for items 1,2, and/or 4. The Minio addition is the only major change I'm really unsure of. If you really want to keep the MinIO piece; can we make it optional? Maybe move it to an Appendix and just list it as an option for folks who don't have an object store already? Thank you again so much for your patience. |
818b7c5
to
c9e246c
Compare
1. Update the ks version to fix the un-standard k8s version issue 2. Update the tensorflow version to v1.11.0 3. Add minio as s3 alternative storage backend 4. Pre-build kubeflow tf-serving code to ksonnect image
a1f0d3d
to
b3959c1
Compare
b3959c1
to
a551d88
Compare
Actually I want to create a new mnist example at first, but I don't want make duplicated work, so I chose a harder way which is creating this PR. Do you agree to create a new mnist example with the new versions and remove the argo(I agree that argo makes things complex)? |
I'd like a single mnist example. Whether we get there by "refactoring" the existing sample; or deleting the current one and creating a new one; doesn't really matter to me. I support the idea of removing Argo. |
There's a codelab that uses the existing example, but as long as that gets updated simultaneously I'm fine with replacing. I agree that a single mnist example makes the most sense. @chenzhiwei could you either update or close this PR? Breaking it down into a number of smaller issues/PRs is always an option. |
sorry, I have no time to work on this now. |
A better look of the changes: https://github.com/chenzhiwei/kubeflow-examples/tree/mnist-model/mnist
This change is