Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow #42

Merged
merged 126 commits into from
Apr 6, 2018

Conversation

elsonrodriguez
Copy link
Contributor

@elsonrodriguez elsonrodriguez commented Mar 10, 2018

This is intended as a way to guide people from existing patterns into training on Kubernetes.

S3 is being used as a data store due to its ubiquity.

We're using the Kubeflow ksonnet code where we can, and intend to swap out more of the templates in argo as we modify ksonnet prototypes to support S3.


This change is Reviewable

nanliu and others added 30 commits February 13, 2018 13:07
Add awscli tools container.
* Add kvc deployment to workflow.

* Switch aws repo.

* wip.

* Add working tfflow job.
- Use correct images for worker and ps
- Use correct aws keys
- Change volumemanager to mnist
- Comment unused steps
- Fix volume mount to correct containers
* Adds fixes to initial serving step
@elsonrodriguez
Copy link
Contributor Author

Review status: 0 of 21 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


e2e/argo-cluster-role.yaml, line 1 at r8 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Is this needed? If are Argo component isn't correct then I'd like to see an issue indicating the bug.
My suggestion would be to delete this for now. If there is in fact an issue with Argo than we can open up an issue and figure out how to resolve it.

Yeah, it's needed so argo can create tfjobs and services directly. However I'm not sure I can classify this as an issue with the upstream Argo component, or that we'd want argo to be able to create services/tfjobs by default.

Perhaps I can open an issue to make the cluster role user-configurable, but I don't know how clean the UX for that will be given the complexity of ClusterRoles.

This may be an appropriate solution given the context.


mnist-s3/README.md, line 199 at r12 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Why use the Argo CLI? Why not just use kubectl? If you just use kubectl then its one less tool users have to install.

Argo is needed to submit the workflow, submitting the workflow directly to the k8s API is bad news. I did end up removing the installation links for the aws and minio clients since we're no longer showing how to upload data to S3. This should reduce the users's load a bit.


mnist-s3/README.md, line 240 at r12 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

FYI; when #297 is submitted the recommended approach would be to launch TB and then connect to it via Ambassador.

Yeah, I plan to do a follow up PR once I add s3 support to the tfjob ks prototype, so I can remove both the training yaml and the tensorboard yaml from the workflow.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Apr 3, 2018

Review status: 0 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


e2e/argo-cluster-role.yaml, line 1 at r8 (raw file):

Previously, elsonrodriguez (Elson Rodriguez) wrote…

Yeah, it's needed so argo can create tfjobs and services directly. However I'm not sure I can classify this as an issue with the upstream Argo component, or that we'd want argo to be able to create services/tfjobs by default.

Perhaps I can open an issue to make the cluster role user-configurable, but I don't know how clean the UX for that will be given the complexity of ClusterRoles.

This may be an appropriate solution given the context.

I meant fixing it here
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/argo/argo.libsonnet#L248

In our ksonnet component for deploying Argo.


mnist-s3/README.md, line 199 at r12 (raw file):

Previously, elsonrodriguez (Elson Rodriguez) wrote…

Argo is needed to submit the workflow, submitting the workflow directly to the k8s API is bad news. I did end up removing the installation links for the aws and minio clients since we're no longer showing how to upload data to S3. This should reduce the users's load a bit.

Why do you need the Argo CLI? Is this because you are using it for parameter substitution. Creating the resource via the K8s APIs/kubectl needs to work because its just a CRD. All of our E2E test infrastructure uses the K8s APIs we don't use the CLI.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Apr 3, 2018

I think this is almost ready. Main feedback is I think if we need to fix the Argo role we should do it in our ksonnet component
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/argo/argo.libsonnet#L248

and not as part of the sample.

@elsonrodriguez
Copy link
Contributor Author

Review status: 0 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


e2e/argo-cluster-role.yaml, line 1 at r8 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I meant fixing it here
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/argo/argo.libsonnet#L248

In our ksonnet component for deploying Argo.

Yeah if you're ok with expanding the default permissions to include all the objects needed for this demo, I can do that.


mnist-s3/README.md, line 199 at r12 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Why do you need the Argo CLI? Is this because you are using it for parameter substitution. Creating the resource via the K8s APIs/kubectl needs to work because its just a CRD. All of our E2E test infrastructure uses the K8s APIs we don't use the CLI.

Whenever I try to submit this directly to the k8s api, the argo ui and cli bomb, and the workflow never completes. I might be using features/syntax in this argo workflow that the E2E tests are not:

$ argo list
2018/04/03 08:47:17 v1alpha1.WorkflowList: Items: []v1alpha1.Workflow: v1alpha1.Workflow: Spec: v1alpha1.WorkflowSpec: Arguments: v1alpha1.Arguments: Parameters: []v1alpha1.Parameter: v1alpha1.Parameter: v1alpha1.Parameter: Value: ReadString: expects " or n, but found 1, error found in #10 byte of ...|,"value":1},{"name":|..., bigger context ...|ents":{"parameters":[{"name":"tf-worker","value":1},{"name":"tf-ps","value":2},{"name":"tf-model-ima|...

Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Apr 3, 2018

Review status: 0 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


e2e/argo-cluster-role.yaml, line 1 at r8 (raw file):

Previously, elsonrodriguez (Elson Rodriguez) wrote…

Yeah if you're ok with expanding the default permissions to include all the objects needed for this demo, I can do that.

Yes.


Comments from Reviewable

elsonrodriguez added a commit to elsonrodriguez/kubeflow that referenced this pull request Apr 4, 2018
If argo's going to be used in ML worfklows, it may need extra permissions.

These are just to get the mnist example over the hump:

kubeflow/examples#42 (comment)
@jlewi
Copy link
Contributor

jlewi commented Apr 6, 2018

Review status: 0 of 21 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


mnist-s3/README.md, line 199 at r12 (raw file):

Previously, elsonrodriguez (Elson Rodriguez) wrote…

Whenever I try to submit this directly to the k8s api, the argo ui and cli bomb, and the workflow never completes. I might be using features/syntax in this argo workflow that the E2E tests are not:

$ argo list
2018/04/03 08:47:17 v1alpha1.WorkflowList: Items: []v1alpha1.Workflow: v1alpha1.Workflow: Spec: v1alpha1.WorkflowSpec: Arguments: v1alpha1.Arguments: Parameters: []v1alpha1.Parameter: v1alpha1.Parameter: v1alpha1.Parameter: Value: ReadString: expects " or n, but found 1, error found in #10 byte of ...|,"value":1},{"name":|..., bigger context ...|ents":{"parameters":[{"name":"tf-worker","value":1},{"name":"tf-ps","value":2},{"name":"tf-model-ima|...

Maybe its Argo's parameter substitution? Any way this is fine.


Comments from Reviewable

@jlewi
Copy link
Contributor

jlewi commented Apr 6, 2018

Looks good except for the lint issues.

@elsonrodriguez
Copy link
Contributor Author

The pylintrc in the repo is reporting 100% clean for mnist_client.py, but the CI is saying it failed.

Debugging.

@jlewi
Copy link
Contributor

jlewi commented Apr 6, 2018

Woo Hoo!

@jlewi
Copy link
Contributor

jlewi commented Apr 6, 2018

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1be7ccb into kubeflow:master Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.