Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow #42

elsonrodriguez · 2018-03-10T03:46:02Z

This is intended as a way to guide people from existing patterns into training on Kubernetes.

S3 is being used as a data store due to its ubiquity.

We're using the Kubeflow ksonnet code where we can, and intend to swap out more of the templates in argo as we modify ksonnet prototypes to support S3.

This change is

Add awscli tools container.

* Add kvc deployment to workflow. * Switch aws repo. * wip. * Add working tfflow job.

- Use correct images for worker and ps - Use correct aws keys - Change volumemanager to mnist - Comment unused steps - Fix volume mount to correct containers

* Adds fixes to initial serving step

elsonrodriguez · 2018-04-02T23:28:51Z

Review status: 0 of 21 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

e2e/argo-cluster-role.yaml, line 1 at r8 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Is this needed? If are Argo component isn't correct then I'd like to see an issue indicating the bug.
My suggestion would be to delete this for now. If there is in fact an issue with Argo than we can open up an issue and figure out how to resolve it.

Yeah, it's needed so argo can create tfjobs and services directly. However I'm not sure I can classify this as an issue with the upstream Argo component, or that we'd want argo to be able to create services/tfjobs by default.

Perhaps I can open an issue to make the cluster role user-configurable, but I don't know how clean the UX for that will be given the complexity of ClusterRoles.

This may be an appropriate solution given the context.

mnist-s3/README.md, line 199 at r12 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Why use the Argo CLI? Why not just use kubectl? If you just use kubectl then its one less tool users have to install.

Argo is needed to submit the workflow, submitting the workflow directly to the k8s API is bad news. I did end up removing the installation links for the aws and minio clients since we're no longer showing how to upload data to S3. This should reduce the users's load a bit.

mnist-s3/README.md, line 240 at r12 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

FYI; when #297 is submitted the recommended approach would be to launch TB and then connect to it via Ambassador.

Yeah, I plan to do a follow up PR once I add s3 support to the tfjob ks prototype, so I can remove both the training yaml and the tensorboard yaml from the workflow.

Comments from Reviewable

jlewi · 2018-04-03T12:34:34Z

Review status: 0 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

e2e/argo-cluster-role.yaml, line 1 at r8 (raw file):

Previously, elsonrodriguez (Elson Rodriguez) wrote…

Yeah, it's needed so argo can create tfjobs and services directly. However I'm not sure I can classify this as an issue with the upstream Argo component, or that we'd want argo to be able to create services/tfjobs by default.

Perhaps I can open an issue to make the cluster role user-configurable, but I don't know how clean the UX for that will be given the complexity of ClusterRoles.

This may be an appropriate solution given the context.

I meant fixing it here
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/argo/argo.libsonnet#L248

In our ksonnet component for deploying Argo.

mnist-s3/README.md, line 199 at r12 (raw file):

Previously, elsonrodriguez (Elson Rodriguez) wrote…

Argo is needed to submit the workflow, submitting the workflow directly to the k8s API is bad news. I did end up removing the installation links for the aws and minio clients since we're no longer showing how to upload data to S3. This should reduce the users's load a bit.

Why do you need the Argo CLI? Is this because you are using it for parameter substitution. Creating the resource via the K8s APIs/kubectl needs to work because its just a CRD. All of our E2E test infrastructure uses the K8s APIs we don't use the CLI.

Comments from Reviewable

jlewi · 2018-04-03T12:35:36Z

I think this is almost ready. Main feedback is I think if we need to fix the Argo role we should do it in our ksonnet component
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/argo/argo.libsonnet#L248

and not as part of the sample.

elsonrodriguez · 2018-04-03T15:53:52Z

Review status: 0 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

e2e/argo-cluster-role.yaml, line 1 at r8 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I meant fixing it here
https://github.com/kubeflow/kubeflow/blob/master/kubeflow/argo/argo.libsonnet#L248

In our ksonnet component for deploying Argo.

Yeah if you're ok with expanding the default permissions to include all the objects needed for this demo, I can do that.

mnist-s3/README.md, line 199 at r12 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Why do you need the Argo CLI? Is this because you are using it for parameter substitution. Creating the resource via the K8s APIs/kubectl needs to work because its just a CRD. All of our E2E test infrastructure uses the K8s APIs we don't use the CLI.

Whenever I try to submit this directly to the k8s api, the argo ui and cli bomb, and the workflow never completes. I might be using features/syntax in this argo workflow that the E2E tests are not:

$ argo list
2018/04/03 08:47:17 v1alpha1.WorkflowList: Items: []v1alpha1.Workflow: v1alpha1.Workflow: Spec: v1alpha1.WorkflowSpec: Arguments: v1alpha1.Arguments: Parameters: []v1alpha1.Parameter: v1alpha1.Parameter: v1alpha1.Parameter: Value: ReadString: expects " or n, but found 1, error found in #10 byte of ...|,"value":1},{"name":|..., bigger context ...|ents":{"parameters":[{"name":"tf-worker","value":1},{"name":"tf-ps","value":2},{"name":"tf-model-ima|...

Comments from Reviewable

jlewi · 2018-04-03T17:00:11Z

Review status: 0 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

e2e/argo-cluster-role.yaml, line 1 at r8 (raw file):

Previously, elsonrodriguez (Elson Rodriguez) wrote…

Yeah if you're ok with expanding the default permissions to include all the objects needed for this demo, I can do that.

Yes.

Comments from Reviewable

If argo's going to be used in ML worfklows, it may need extra permissions. These are just to get the mnist example over the hump: kubeflow/examples#42 (comment)

jlewi · 2018-04-06T05:53:13Z

Review status: 0 of 21 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

mnist-s3/README.md, line 199 at r12 (raw file):

Previously, elsonrodriguez (Elson Rodriguez) wrote…

Whenever I try to submit this directly to the k8s api, the argo ui and cli bomb, and the workflow never completes. I might be using features/syntax in this argo workflow that the E2E tests are not:

$ argo list
2018/04/03 08:47:17 v1alpha1.WorkflowList: Items: []v1alpha1.Workflow: v1alpha1.Workflow: Spec: v1alpha1.WorkflowSpec: Arguments: v1alpha1.Arguments: Parameters: []v1alpha1.Parameter: v1alpha1.Parameter: v1alpha1.Parameter: Value: ReadString: expects " or n, but found 1, error found in #10 byte of ...|,"value":1},{"name":|..., bigger context ...|ents":{"parameters":[{"name":"tf-worker","value":1},{"name":"tf-ps","value":2},{"name":"tf-model-ima|...

Maybe its Argo's parameter substitution? Any way this is fine.

Comments from Reviewable

jlewi · 2018-04-06T05:53:50Z

Looks good except for the lint issues.

elsonrodriguez · 2018-04-06T18:11:26Z

The pylintrc in the repo is reporting 100% clean for mnist_client.py, but the CI is saying it failed.

Debugging.

jlewi · 2018-04-06T21:31:11Z

Woo Hoo!

jlewi · 2018-04-06T21:32:56Z

/lgtm
/approve

k8s-ci-robot · 2018-04-06T21:32:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nanliu and others added 30 commits February 13, 2018 13:07

Add awscli tools container.

26a13a4

Merge pull request #1 from nanliu/nan/awscli

83cf370

Add awscli tools container.

Add initial readme.

68bdb62

Add argo skeleton.

c9c7fd9

Run a an argo job.

135cfd8

Use built container (#3)

e3bb844

Artifact support and argo test

6305eca

Fix artifacts and secrets

747488f

Add work in progress tfflow (#14)

2910012

* Add kvc deployment to workflow. * Switch aws repo. * wip. * Add working tfflow job.

Add sidecar that waits for MASTER completion

c795661

Pass in job-name

5c6fdce

Add volumemanager info step

55cea99

Add input parameters to step

b81752a

Adds nodeaffinity and hostpath

d760478

Add fixes for workflow (kubeflow#17)

5b6f6b2

- Use correct images for worker and ps - Use correct aws keys - Change volumemanager to mnist - Comment unused steps - Fix volume mount to correct containers

Fix hostpath for tfjob

1e6a77e

Download all mnist files

067f237

added GCS stored artifacts comptability to Argo

c25d1e5

Add initial inference workflow. (kubeflow#30)

718441f

Initial serving step (kubeflow#31)

767a45c

* Adds fixes to initial serving step

Ready for rough demo: Workflow in working state

2bb262a

Move conflicting readme.

5df6c33

Initial commit, everything boots without crashing.

e044a87

Working, with some python errors.

c7763ea

Adding explicit flags

c519b46

Working with ins-outs

a052a3d

Letting training job exit on success

6f9ff59

Adding documentation skeletion

1d69ec0

trying to properly save model

cc5d3ec

Almost working

3bc0c65

elsonrodriguez added 2 commits April 2, 2018 14:24

Renaming directory

d357846

Minor doc improvements, removed extra clis.

460f494

Making SSL configurable for clusters without secured s3 endpoints.

b462ae2

elsonrodriguez mentioned this pull request Apr 4, 2018

Expanding argo permissions to enable creation of more objects kubeflow/kubeflow#591

Closed

bmenn mentioned this pull request Apr 5, 2018

[Discussion] ETL/Data engineering in kubeflow kubeflow/kubeflow#382

Closed

elsonrodriguez added 2 commits April 5, 2018 17:33

Added a tf-user account for workflow. Fixed serving bug.

7498396

Updating gke version.

7b312bf

elsonrodriguez added 4 commits April 5, 2018 23:40

Re-ran through instructions, fixed errata.

9773c6c

Fixing lint issues

63cd9d5

Pylint errors

4d3909f

Pylint errors

7c5bc37

elsonrodriguez added 4 commits April 6, 2018 13:21

Adding parenthesis back.

2589394

pylint Hacks

b1e9ad7

Disabling argument filter, model bombs without empty arg.

50c94a2

Removing unneeded lambdas

3465825

k8s-ci-robot added the lgtm label Apr 6, 2018

k8s-ci-robot added the approved label Apr 6, 2018

k8s-ci-robot merged commit 1be7ccb into kubeflow:master Apr 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow #42

Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow #42

elsonrodriguez commented Mar 10, 2018 •

edited

Loading

elsonrodriguez commented Apr 2, 2018

jlewi commented Apr 3, 2018

jlewi commented Apr 3, 2018

elsonrodriguez commented Apr 3, 2018

jlewi commented Apr 3, 2018

jlewi commented Apr 6, 2018

jlewi commented Apr 6, 2018

elsonrodriguez commented Apr 6, 2018

jlewi commented Apr 6, 2018

jlewi commented Apr 6, 2018

k8s-ci-robot commented Apr 6, 2018

Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow #42

Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow #42

Conversation

elsonrodriguez commented Mar 10, 2018 • edited Loading

elsonrodriguez commented Apr 2, 2018

jlewi commented Apr 3, 2018

jlewi commented Apr 3, 2018

elsonrodriguez commented Apr 3, 2018

jlewi commented Apr 3, 2018

jlewi commented Apr 6, 2018

jlewi commented Apr 6, 2018

elsonrodriguez commented Apr 6, 2018

jlewi commented Apr 6, 2018

jlewi commented Apr 6, 2018

k8s-ci-robot commented Apr 6, 2018

elsonrodriguez commented Mar 10, 2018 •

edited

Loading