Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #2: End to end model training/serving example using S3, Argo, and Kubeflow #42

Merged
merged 126 commits into from
Apr 6, 2018
Merged
Show file tree
Hide file tree
Changes from 104 commits
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
26a13a4
Add awscli tools container.
nanliu Feb 13, 2018
83cf370
Merge pull request #1 from nanliu/nan/awscli
elsonrodriguez Feb 13, 2018
68bdb62
Add initial readme.
nanliu Feb 9, 2018
c9c7fd9
Add argo skeleton.
nanliu Feb 9, 2018
135cfd8
Run a an argo job.
nanliu Feb 9, 2018
e3bb844
Use built container (#3)
jose5918 Feb 10, 2018
6305eca
Artifact support and argo test
jose5918 Feb 10, 2018
747488f
Fix artifacts and secrets
jose5918 Feb 10, 2018
2910012
Add work in progress tfflow (#14)
nanliu Feb 22, 2018
c795661
Add sidecar that waits for MASTER completion
jose5918 Feb 22, 2018
5c6fdce
Pass in job-name
jose5918 Feb 22, 2018
55cea99
Add volumemanager info step
jose5918 Feb 23, 2018
b81752a
Add input parameters to step
jose5918 Feb 23, 2018
d760478
Adds nodeaffinity and hostpath
jose5918 Feb 23, 2018
5b6f6b2
Add fixes for workflow (#17)
jose5918 Feb 26, 2018
1e6a77e
Fix hostpath for tfjob
jose5918 Feb 26, 2018
067f237
Download all mnist files
jose5918 Feb 27, 2018
c25d1e5
added GCS stored artifacts comptability to Argo
Feb 28, 2018
718441f
Add initial inference workflow. (#30)
nanliu Mar 1, 2018
767a45c
Initial serving step (#31)
jose5918 Mar 1, 2018
2bb262a
Ready for rough demo: Workflow in working state
jose5918 Mar 1, 2018
5df6c33
Move conflicting readme.
nanliu Mar 2, 2018
e044a87
Initial commit, everything boots without crashing.
elsonrodriguez Feb 13, 2018
c7763ea
Working, with some python errors.
elsonrodriguez Feb 13, 2018
c519b46
Adding explicit flags
elsonrodriguez Feb 13, 2018
a052a3d
Working with ins-outs
elsonrodriguez Feb 13, 2018
6f9ff59
Letting training job exit on success
elsonrodriguez Feb 14, 2018
1d69ec0
Adding documentation skeletion
elsonrodriguez Feb 14, 2018
cc5d3ec
trying to properly save model
elsonrodriguez Feb 16, 2018
3bc0c65
Almost working
elsonrodriguez Feb 20, 2018
7163997
Working
elsonrodriguez Feb 20, 2018
b1b2085
Adding export script, refactored to allow model more reusability
elsonrodriguez Feb 21, 2018
3c6f72a
Starting documentation
elsonrodriguez Feb 21, 2018
8a10f70
little further on docs
elsonrodriguez Feb 22, 2018
e734d3d
More doc updates, fixing sleep logic
elsonrodriguez Feb 22, 2018
9b10134
adding urls for mnist data
elsonrodriguez Feb 23, 2018
4bec712
Removing download logic, it's to tied in with build-in tf examples.
elsonrodriguez Feb 23, 2018
b9aee49
Added argo workflow instructions, minor cleanups.
elsonrodriguez Feb 23, 2018
af960ac
Adding mnist client.
elsonrodriguez Feb 23, 2018
c7b9a81
Fixing typos
elsonrodriguez Feb 24, 2018
bcb8c58
Adding instructions for installing components.
elsonrodriguez Feb 26, 2018
ea6bd32
Added ksonnet container
elsonrodriguez Feb 27, 2018
7a2f2b3
Adding new entrypoint.
elsonrodriguez Feb 27, 2018
536d64d
Added helm install instructions for kvc
elsonrodriguez Feb 27, 2018
b75b7fe
doing things with variables
elsonrodriguez Feb 27, 2018
46d811d
Typos.
elsonrodriguez Mar 2, 2018
0241f41
Added better namespace support
elsonrodriguez Mar 3, 2018
98c99ae
S3 refactor.
elsonrodriguez Mar 4, 2018
1487dad
Added missing region variables.
elsonrodriguez Mar 4, 2018
d3834f7
Adding tensorboard support.
elsonrodriguez Mar 5, 2018
ca13b34
Addding Container for Tensorboard.
elsonrodriguez Mar 5, 2018
ac7475a
Added temporary flag, added install instructions for CLI.
elsonrodriguez Mar 5, 2018
e05dbc4
Removing invalid ksonnet environment.
elsonrodriguez Mar 5, 2018
91353cd
Updating readme
jose5918 Mar 6, 2018
de2d2ac
Merge pull request #2 from elsonrodriguez/jose/doc-updates
elsonrodriguez Mar 6, 2018
10a50f0
Cleanup currently unused pieces
jose5918 Mar 6, 2018
d33cdc8
Add missint cluster-role
jose5918 Mar 6, 2018
f263a0f
Merge pull request #3 from elsonrodriguez/jose/cleanup
elsonrodriguez Mar 6, 2018
1d41e1a
Minor cleanup.
elsonrodriguez Mar 7, 2018
9998b3f
Adding more parameters.
elsonrodriguez Mar 7, 2018
bc22f94
added changes to allow model to train on multiple workers and fixed s…
Mar 7, 2018
8f1d8a1
Adding flag to enable/disable model serving. Adding s3 urls as output…
elsonrodriguez Mar 8, 2018
799363e
Merge pull request #8 from elsonrodriguez/split-poc
elsonrodriguez Mar 8, 2018
72296f0
Adding seperate deployer workflow.
elsonrodriguez Mar 8, 2018
f89f554
Split serving working.
elsonrodriguez Mar 8, 2018
9e65e44
Adding split workflow.
elsonrodriguez Mar 8, 2018
5b45e1a
Merge pull request #9 from elsonrodriguez/split-poc
elsonrodriguez Mar 8, 2018
66f7bd5
More parameters.
elsonrodriguez Mar 8, 2018
987b1e1
updates as to elson comments
Mar 8, 2018
22625e8
Merge branch 'e2e' of https://github.com/elsonrodriguez/examples into…
Mar 8, 2018
783e370
Merge pull request #7 from elsonrodriguez/model_changes
elsonrodriguez Mar 8, 2018
74c60f0
Revert "added changes to allow model to train on multiple workers and…
elsonrodriguez Mar 8, 2018
72254a3
Merge pull request #10 from elsonrodriguez/revert-7-model_changes
elsonrodriguez Mar 8, 2018
4a3da68
Initial working pure-s3 workflow.
elsonrodriguez Mar 9, 2018
6efdef2
Removed wait sidecars.
elsonrodriguez Mar 9, 2018
d41e090
Merge pull request #11 from elsonrodriguez/full-s3
elsonrodriguez Mar 9, 2018
22360e0
Remove unused flag.
elsonrodriguez Mar 9, 2018
522e75f
Added part two, minor doc fixes
elsonrodriguez Mar 9, 2018
f0e05d3
Inverted links...
elsonrodriguez Mar 9, 2018
c0de628
Adding diff.
elsonrodriguez Mar 9, 2018
b12ab37
Fix url syntax
elsonrodriguez Mar 9, 2018
065921f
Documentation updates.
elsonrodriguez Mar 9, 2018
bd232a1
Added AWS Cli
elsonrodriguez Mar 9, 2018
4fc2e11
Parameterized export.
elsonrodriguez Mar 9, 2018
609eea3
Fixing image in s3 version.
elsonrodriguez Mar 9, 2018
3c4770c
Fixed documentation issues.
elsonrodriguez Mar 9, 2018
c2e87bf
KVC snippet changes, need to find last working helm chart.
elsonrodriguez Mar 10, 2018
5662edb
Temporarily pinning kvc version.
elsonrodriguez Mar 10, 2018
76075d3
working master model and some doc typos fixes (#13)
raddaoui Mar 11, 2018
d466411
Syncing Demo changes.
elsonrodriguez Mar 12, 2018
fb169ed
Update README.md
aronchick Mar 13, 2018
2854dc5
Going S3-native for initial example. Getting rid of Master.
elsonrodriguez Mar 14, 2018
18d467f
Merge branch 'e2e' of github.com:elsonrodriguez/examples into e2e
elsonrodriguez Mar 14, 2018
de4aede
Minor documentation tweaks, adding params, swapping aws cli for minio.
elsonrodriguez Mar 15, 2018
e5989e1
Updating KVC version.
elsonrodriguez Mar 16, 2018
7cf5508
Switching ksonnet repo, removing model name from client.
elsonrodriguez Mar 16, 2018
13d26a9
Updating git url.
elsonrodriguez Mar 16, 2018
1bc5611
Adding certificate hack to avoid RBAC errors.
elsonrodriguez Mar 16, 2018
8900d92
Pinning KVC to commit while working on PR.
elsonrodriguez Mar 16, 2018
8be1415
Updating version.
elsonrodriguez Mar 17, 2018
60ee9f2
Updates README with additional details (#14)
mhbuehler Mar 17, 2018
2f8a32c
Merge branch 'e2e' of github.com:elsonrodriguez/examples into e2e
elsonrodriguez Mar 17, 2018
914835d
Refactoring notes for github and kubernetes credentials.
elsonrodriguez Mar 17, 2018
0f7e9e5
Forgot to add an overview of the argo template.
elsonrodriguez Mar 19, 2018
d0c3608
Updating example based on feedback.
elsonrodriguez Mar 20, 2018
7596396
Refactored grpc image into generic base image.
elsonrodriguez Mar 21, 2018
8388749
minor cleanup of resubmitting section.
elsonrodriguez Mar 21, 2018
32a3596
Switching Argo deployment to ksonnet, conslidating install instructions.
elsonrodriguez Mar 22, 2018
8650e2a
Removing old cruft, clarifying cluster requirements.
elsonrodriguez Mar 23, 2018
137d2d7
[WIP] Switching out model (#15)
elsonrodriguez Mar 31, 2018
3b60c7d
Merge remote-tracking branch 'upstream/master' into e2e
elsonrodriguez Mar 31, 2018
1225714
Removing unused Dockerfile.
elsonrodriguez Mar 31, 2018
9f14541
Removing uneeded files, simplifying how to get status, refactor model…
elsonrodriguez Apr 2, 2018
d357846
Renaming directory
elsonrodriguez Apr 2, 2018
460f494
Minor doc improvements, removed extra clis.
elsonrodriguez Apr 2, 2018
b462ae2
Making SSL configurable for clusters without secured s3 endpoints.
elsonrodriguez Apr 4, 2018
7498396
Added a tf-user account for workflow. Fixed serving bug.
elsonrodriguez Apr 6, 2018
7b312bf
Updating gke version.
elsonrodriguez Apr 6, 2018
9773c6c
Re-ran through instructions, fixed errata.
elsonrodriguez Apr 6, 2018
63cd9d5
Fixing lint issues
elsonrodriguez Apr 6, 2018
4d3909f
Pylint errors
elsonrodriguez Apr 6, 2018
7c5bc37
Pylint errors
elsonrodriguez Apr 6, 2018
2589394
Adding parenthesis back.
elsonrodriguez Apr 6, 2018
b1e9ad7
pylint Hacks
elsonrodriguez Apr 6, 2018
50c94a2
Disabling argument filter, model bombs without empty arg.
elsonrodriguez Apr 6, 2018
3465825
Removing unneeded lambdas
elsonrodriguez Apr 6, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions e2e/Dockerfile.ksonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
FROM ubuntu:16.04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this Dockerfile for? Is this for boot strapping?
Could you add a comment explaining that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to run ksonnet in a container in the workflow. I'll add a comment.


ENV KUBECTL_VERSION v1.9.2
ENV KSONNET_VERSION 0.8.0

RUN apt-get update
RUN apt-get -y install curl
#RUN apk add --update ca-certificates openssl && update-ca-certificates

RUN curl -O -L https://github.com/ksonnet/ksonnet/releases/download/v${KSONNET_VERSION}/ks_${KSONNET_VERSION}_linux_amd64.tar.gz
RUN tar -zxvf ks_${KSONNET_VERSION}_linux_amd64.tar.gz -C /usr/bin/ --strip-components=1 ks_${KSONNET_VERSION}_linux_amd64/ks
RUN chmod +x /usr/bin/ks

RUN curl -L https://storage.googleapis.com/kubernetes-release/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -o /usr/bin/kubectl
RUN chmod +x /usr/bin/kubectl

#ksonnet doesn't work without a kubeconfig, the following is just to add a utility to generate a kubeconfig from a service account.
ADD https://raw.githubusercontent.com/zlabjp/kubernetes-scripts/cb265de1d4c4dc4ad0f15f4aaaf5b936dcf639a5/create-kubeconfig /usr/bin/
ADD https://raw.githubusercontent.com/zlabjp/kubernetes-scripts/cb265de1d4c4dc4ad0f15f4aaaf5b936dcf639a5/LICENSE.txt /usr/bin/create-kubeconfig.LICENSE
RUN chmod +x /usr/bin/create-kubeconfig

RUN kubectl config set-context default --cluster=default
RUN kubectl config use-context default

ENV USER root

ADD ksonnet-entrypoint.sh /
RUN chmod +x /ksonnet-entrypoint.sh

ENTRYPOINT ["/ksonnet-entrypoint.sh"]
9 changes: 9 additions & 0 deletions e2e/Dockerfile.model
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM elsonrodriguez/mytfserver:1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this base image?
Long term it would be preferable not to depend on personal images.

Copy link
Contributor Author

@elsonrodriguez elsonrodriguez Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a base image containing the shim for TF_CONFIG and the stock grpc tf server. I ended up not using it in the example, however it helps reduce build time, and helps protect the model container from upstream changes (the official TF tags get overwritten sometimes)

I can make it an un-namespaced local image. (and also add more comments)


ADD model.py /opt/model.py
ADD export.py /opt/export.py

RUN chmod +x /opt/model.py
RUN chmod +x /opt/export.py

CMD ["python", "/opt/model.py"]
6 changes: 6 additions & 0 deletions e2e/Dockerfile.tensorboard
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
FROM gcr.io/kubeflow/jupyterhub-k8s:1.0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you need this for? tensorboard should be installed in the standard tensorflow Docker image. So if all you need is a Docker image with tensorboard can we just use the stock TensorFlow docker image?

Copy link
Contributor Author

@elsonrodriguez elsonrodriguez Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get it to work with the stock TF image. It might be a Tensorboard 1.5 issue or the way the image is built. But I just kept getting "No dashboards are active for the current data set." The combination of of TF 1.5 and Tensorboard 1.6 works fine.

EDIT: Nevermind, I don't know what I was seeing, but 1.5.1 looks fine. Will remove this image from the guide.


RUN pip install tensorboard==1.6.0 tensorflow==1.5.0


ENTRYPOINT ["/usr/local/bin/tensorboard", "--logdir"]
8 changes: 8 additions & 0 deletions e2e/Dockerfile.tfserver
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#FROM tensorflow/tf_grpc_test_server:ccbc039fbe5a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "1.5" image in TF was broken when I was writing the guide, and they didn't have any tag with 1.5 other than "latest". So I just made a note of the sha1.

There's a new 1.5.1 image that might work, I'll try that.

FROM tensorflow/tf_grpc_test_server:latest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need our own custom gRPC server for parameter servers? If the code is using the tf.Estimator API and you call train_and_evaluate
https://www.tensorflow.org/api_docs/python/tf/estimator/train_and_evaluate

I think that will automatically start PS as needed based on TF_CONFIG
This line

which calls run_ps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is confusing, and I ended up not using the standard grpc server. I think I'm going to strip out references to it other than as an initial base image for the model.

ADD tf_job_shim.py /tf_job_shim.py
RUN chmod +x /tf_job_shim.py

ENTRYPOINT ["/tf_job_shim.py"]
CMD ["python", "/var/tf-k8s/server/grpc_tensorflow_server.py"]
Loading