Reinforcement learning example with TensorFlow Agents #30

cwbeitel · 2018-03-05T00:44:41Z

See also #1, https://github.com/cwbeitel/examples-old for previous PR

This change is

see also https://github.com/cwbeitel/examples-old for previous PR

cwbeitel · 2018-03-05T00:51:36Z

Assigning original reviewers:
/assign jlewi
/assign nkashy1
/assign danijar

k8s-ci-robot · 2018-03-05T00:51:37Z

@cwbeitel: GitHub didn't allow me to assign the following users: danijar.

Note that only kubeflow members and repo collaborators can be assigned.

In response to this:

Assigning original reviewers:
/assign jlewi
/assign nkashy1
/assign danijar

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zomglings

@cwbeitel : My only lingering concern is that we aren't explaining the kubeflow components (in the app/ subdirectory) sufficiently.

However, the rest of this looks great, so I say we merge it and then raise issues as necessary. :)

LGTM

cwbeitel · 2018-03-06T18:16:13Z

@nkashy1 Awesome I can definitely add that to a subsequent version, thanks for the feedback.

jlewi · 2018-03-07T06:15:10Z

Reviewed 11 of 27 files at r1.
Review status: 11 of 27 files reviewed at latest revision, all discussions resolved.

agents/README.md, line 3 at r1 (raw file):

# Reinforcement Learning with [tensorflow/agents](https://github.com/tensorflow/agents)

[![Docker Repository on Quay](https://quay.io/repository/cwbeitel/agents-demo/status "Docker Repository on Quay")](https://quay.io/repository/cwbeitel/agents-demo)

Lets not reference a personal container image. Please delete this and file an issue to publish a container if need be. If you want you can put a link to your Docker image in the container.

For the instructions you could just tell users to build the container and push it to the registry of their choice.

agents/README.md, line 50 at r1 (raw file):

We might want to include various additional dependencies in our notebook container to aid in development of our model. In that case we might be interested in modifying the demo Dockerfile and re-building the container.

The demo container can be built using the standard workflow:

Just move these instructions to the top

agents/doc/Dockerfile, line 13 at r1 (raw file):

# limitations under the License.

FROM gcr.io/kubeflow/tensorflow-notebook-cpu

What's this container for?

Comments from Reviewable

- Previously instructed users to build demo container via doc/Dockerfile. - Since rendering isn't working in the notebook and the rest of the installed dependencies are now available in the base tensorflow-notebook container building a container isn't necessary. - Users are now instructed to run the base tensorflow-notebook-cpu container and clone the example code with git. - The git clone command refers to https://github.com/kubeflow/examples instead of the URL of this fork so the docs will be incorrect in that regard until this is merged into master. Optionally until then we can add an instruction to switch branch.

cwbeitel · 2018-03-08T04:30:44Z

Review status: 10 of 27 files reviewed at latest revision, 3 unresolved discussions.

agents/doc/Dockerfile, line 13 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

What's this container for?

Not necessary (now/given) that gcloud and ks are available through the base container and renders aren't working in-notebook so the ffmpeg dependency doesn't justify the burden of working with a secondary container. See also commit notes.

Comments from Reviewable

jlewi · 2018-03-09T22:59:10Z

Woo Hoo! Great Work
/lgtm
/approve

k8s-ci-robot · 2018-03-09T22:59:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

danijar · 2018-03-11T16:25:46Z

Congrats to getting this first step into the code base!

danijar · 2018-03-11T16:27:03Z

(Not sure why Github shows me "danijar unassigned nkashy1" but it seems like I can't change it back.)

…nd Kubeflow (#42) * Add awscli tools container. * Add initial readme. * Add argo skeleton. * Run a an argo job. * Artifact support and argo test * Use built container (#3) * Fix artifacts and secrets * Add work in progress tfflow (#14) * Add kvc deployment to workflow. * Switch aws repo. * wip. * Add working tfflow job. * Add sidecar that waits for MASTER completion * Pass in job-name * Add volumemanager info step * Add input parameters to step * Adds nodeaffinity and hostpath * Add fixes for workflow (#17) - Use correct images for worker and ps - Use correct aws keys - Change volumemanager to mnist - Comment unused steps - Fix volume mount to correct containers * Fix hostpath for tfjob * Download all mnist files * added GCS stored artifacts comptability to Argo * Add initial inference workflow. (#30) * Initial serving step (#31) * Adds fixes to initial serving step * Ready for rough demo: Workflow in working state * Move conflicting readme. * Initial commit, everything boots without crashing. * Working, with some python errors. * Adding explicit flags * Working with ins-outs * Letting training job exit on success * Adding documentation skeletion * trying to properly save model * Almost working * Working * Adding export script, refactored to allow model more reusability * Starting documentation * little further on docs * More doc updates, fixing sleep logic * adding urls for mnist data * Removing download logic, it's to tied in with build-in tf examples. * Added argo workflow instructions, minor cleanups. * Adding mnist client. * Fixing typos * Adding instructions for installing components. * Added ksonnet container * Adding new entrypoint. * Added helm install instructions for kvc * doing things with variables * Typos. * Added better namespace support * S3 refactor. * Added missing region variables. * Adding tensorboard support. * Addding Container for Tensorboard. * Added temporary flag, added install instructions for CLI. * Removing invalid ksonnet environment. * Updating readme * Cleanup currently unused pieces * Add missint cluster-role * Minor cleanup. * Adding more parameters. * added changes to allow model to train on multiple workers and fixed some doc typos * Adding flag to enable/disable model serving. Adding s3 urls as outputs for future querying, renaming info step. * Adding seperate deployer workflow. * Split serving working. * Adding split workflow. * More parameters. * updates as to elson comments * Revert "added changes to allow model to train on multiple workers and fixed s…" * Initial working pure-s3 workflow. * Removed wait sidecars. * Remove unused flag. * Added part two, minor doc fixes * Inverted links... * Adding diff. * Fix url syntax * Documentation updates. * Added AWS Cli * Parameterized export. * Fixing image in s3 version. * Fixed documentation issues. * KVC snippet changes, need to find last working helm chart. * Temporarily pinning kvc version. * working master model and some doc typos fixes (#13) * added changes to allow model to train on multiple workers and fixed some doc typos * Adding flag to enable/disable model serving. Adding s3 urls as outputs for future querying, renaming info step. * Adding seperate deployer workflow. * Split serving working. * Adding split workflow. * More parameters. * updates as to elson comments * working master model and some doc typos * fixes as to Elson * Removign whitespace differences * updating diff * Changing parameters. * Undoing whitespace. * Changing termination policy on s3 version due to unknown issue. * Updating mnist diff. * Changing train steps. * Syncing Demo changes. * Update README.md * Going S3-native for initial example. Getting rid of Master. * Minor documentation tweaks, adding params, swapping aws cli for minio. * Updating KVC version. * Switching ksonnet repo, removing model name from client. * Updating git url. * Adding certificate hack to avoid RBAC errors. * Pinning KVC to commit while working on PR. * Updating version. * Updates README with additional details (#14) * Updates README with additional details * Adding clarity to kubectl config commands * Fixed comma placement * Refactoring notes for github and kubernetes credentials. * Forgot to add an overview of the argo template. * Updating example based on feedback. - Removed superflous images - Clarified use of KVC - Added unaltered model - Variable cleanup * Refactored grpc image into generic base image. * minor cleanup of resubmitting section. * Switching Argo deployment to ksonnet, conslidating install instructions. * Removing old cruft, clarifying cluster requirements. * [WIP] Switching out model (#15) * Switching to new mnist example. * Parameterized model, testing export. * Got CNN model exporting. * Attempting to do distributed training with Estimator, removed seperate export. * Adding master back, otherwise Estimator complains about not having a chief. * Switching to tf.estimator.train_and_evaluate. * Minor path/var name refactor. * Adding test data and new client. * Fixed documentation to reflect new client. * Getting rid of tf job shim. * Removing KVC from example, renaming directory * Modifying parent README * Removed reference to export. * Adding reference to export. * Removing unused Dockerfile. * Removing uneeded files, simplifying how to get status, refactor model serving workflow step. * Renaming directory * Minor doc improvements, removed extra clis. * Making SSL configurable for clusters without secured s3 endpoints. * Added a tf-user account for workflow. Fixed serving bug. * Updating gke version. * Re-ran through instructions, fixed errata. * Fixing lint issues * Pylint errors * Pylint errors * Adding parenthesis back. * pylint Hacks * Disabling argument filter, model bombs without empty arg. * Removing unneeded lambdas

reinforcement learning example with TensorFlow Agents

4e1e42a

see also https://github.com/cwbeitel/examples-old for previous PR

k8s-ci-robot requested review from wbuchwalter and zjj2wry March 5, 2018 00:44

k8s-ci-robot added the size/XXL label Mar 5, 2018

k8s-ci-robot assigned jlewi and zomglings Mar 5, 2018

cwbeitel mentioned this pull request Mar 5, 2018

[Agents RL] Demonstrate Kubeflow with an E2E RL example #5

Closed

9 tasks

zomglings approved these changes Mar 6, 2018

View reviewed changes

cwbeitel added 2 commits March 8, 2018 04:00

Remove demo container refs, assume users build

fbf4356

Remove unused Dockerfile

64de15f

k8s-ci-robot added the lgtm label Mar 9, 2018

k8s-ci-robot added the approved label Mar 9, 2018

jlewi merged commit 0837557 into kubeflow:master Mar 9, 2018

danijar unassigned zomglings Mar 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinforcement learning example with TensorFlow Agents #30

Reinforcement learning example with TensorFlow Agents #30

cwbeitel commented Mar 5, 2018 •

edited by jlewi

Loading

cwbeitel commented Mar 5, 2018

k8s-ci-robot commented Mar 5, 2018

zomglings left a comment

cwbeitel commented Mar 6, 2018

jlewi commented Mar 7, 2018

cwbeitel commented Mar 8, 2018

jlewi commented Mar 9, 2018

k8s-ci-robot commented Mar 9, 2018

danijar commented Mar 11, 2018

danijar commented Mar 11, 2018

Reinforcement learning example with TensorFlow Agents #30

Reinforcement learning example with TensorFlow Agents #30

Conversation

cwbeitel commented Mar 5, 2018 • edited by jlewi Loading

cwbeitel commented Mar 5, 2018

k8s-ci-robot commented Mar 5, 2018

zomglings left a comment

Choose a reason for hiding this comment

cwbeitel commented Mar 6, 2018

jlewi commented Mar 7, 2018

cwbeitel commented Mar 8, 2018

jlewi commented Mar 9, 2018

k8s-ci-robot commented Mar 9, 2018

danijar commented Mar 11, 2018

danijar commented Mar 11, 2018

cwbeitel commented Mar 5, 2018 •

edited by jlewi

Loading