-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reinforcement learning example with tensorflow/agents #1
Conversation
Status: * short run functions without issue, see notebook, long run in progress * Needs E2E test * need GCP project permissions to modify service account roles or have someone do so to be able to test run tboard stuff
Long training run with most recent Agents version (https://github.com/tensorflow/agents/tree/14669421f6a1699b9100cc42071a79c4dca2f1c3) produced low quality result evident in low (negative) mean_score and renders (below). Reverting to last version (https://github.com/tensorflow/agents/tree/459c4f88ece996eac3489e6e97a6ee0b30bdd6b3) that was used to produce good results for this environment. Tboard: http://127.0.0.1:8001/api/v1/proxy/namespaces/rl/services/tboard-0204-0911-f303-tb:80/#scalars&run=eval {'algorithm': <class 'agents.algorithms.ppo.ppo.PPO'>,
'debug': True,
'discount': 0.995,
'dump_dependency_versions': False,
'env': 'KukaBulletEnv-v0',
'env_processes': True,
'eval_episodes': 25,
'hparam_set_id': 'pybullet_kuka_ff',
'init_logstd': -1,
'init_mean_factor': 0.1,
'init_output_factor': 0.1,
'init_std': 0.35,
'kl_cutoff_coef': 1000,
'kl_cutoff_factor': 2,
'kl_init_penalty': 1,
'kl_target': 0.01,
'learning_rate': 0.0001,
'log_device_placement': False,
'logdir': 'gs://kubeflow-rl-kf/jobs/pybullet-kuka-0202-1053-3a98',
'max_length': 1000,
'network': <function feed_forward_gaussian at 0x7ff0334ae938>,
'normalize_ranges': True,
'num_agents': 30,
'num_gpus': 0,
'optimizer': <class 'tensorflow.python.training.adam.AdamOptimizer'>,
'policy_layers': (200, 100),
'render_out_dir': None,
'render_secs': 600,
'run_base_tag': '0e90193e',
'run_mode': 'train',
'save_checkpoint_secs': 600,
'steps': 40000000,
'sync_replicas': False,
'update_epochs': 25,
'update_every': 30,
'use_gpu': False,
'use_monitored_training_session': True,
'value_layers': (200, 100),
'weight_summaries': {'all': '.*',
'policy': '.*/policy/.*',
'value': '.*/value/.*'}} |
Thanks for this. Is this PR ready for review? The PR description indicates there's work TBD. If its still in progress and not ready for review please prefix the title with [WIP] and remove it when its ready to review. |
There will be multiple examples in this repository every example should be in its own top level directory |
@@ -0,0 +1,60 @@ | |||
local params = std.extVar("__ksonnet/params").components["kubeflow-core"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we prune the ksonnet app so that it doesn't contain core Kubeflow components?
i.e. should the app assume you already have Kubeflow deployed?
tools/base/build-base
Outdated
@@ -0,0 +1,70 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this script fo?
If its a bash file it should have a .sh extension.
tools/kf-build-app
Outdated
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
get_project_id() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this function defined in the other file as well?
Sure I'll make those changes thanks for the feedback. |
@nkashy1 Do you want to be a reviewer of this PR when its ready? |
Did some experiments to find:
Here there seems to be both instability of good solutions and variability in the overall outcome (two are clear failures). There may very well be hparam configurations that will improve this in the individual case, this may be ironed out with the right distributed training / shared parameters scheme, or it might be best solved by early stopping of failing param sets replicating working ones. @danijar idk if this is interesting to you or you might have thoughts. The following versions were used in the above tensorflow==1.4.1
pybullet==1.7.5
gym==0.9.4
agents@459c4f88ece996eac3489e6e97a6ee0b30bdd6b3 also here's a full screen grab from tboard for various other scalars Renders are in progress, will be landing in subdirs of gs://kubeflow-rl-kf/jobs/agents that correspond to the replica jobs. |
Thanks for the update! What task is this? Gym recently had a few API changes, updating all libraries should work. I recently added a config for AntBulletEnv-v0 to the main repository that sets It would still be super cool to get the SyncReplicaOptimizer working, but I'm not sure how feasible that is. What do you think? |
Dude, it would be super cool!! I don't know the feasibility but am interested to explore it.
Also it's worth noting that it may be more feasible than I (or we) thought previously when there were other errors masking visibility on synchronous distribution-related issues. These are those [1] caused by either conflicts in checkpoint saving, Saver not supporting MonitoredTrainingSession (expecting Session), or successful jobs "success-failing" (for lack of a better label; where successful jobs would still exit with an error code, then crash for a different reason after re-starting). The task is 'KukaBulletEnv-v0'. And that's great to hear about update_every!! I'll try that out. |
Ready for review. Training performance seems less variable with update_every=60 but this only used 4 replicas. Played around with managing a set of experiments from submission through comparing with tboard to managing renders. The latter could be a lot better. It might be really useful to be able to manage renders via a tab in tensorboard. |
@cwbeitel, @danijar : Did a quick skim of the demo notebook - I didn't understand the reason for running the same job over four different workers. In general, though, looks nice. I do have plenty of suggestions and will probably have more once I actually run the code. This did get me wondering as far as tensorflow/agents and kubeflow - how difficult would it be to deploy each simulation process as a different pod and collect all the results into the same training pod? Just to make sure I understand, this isn't what is happening now, right? Could we implement this with a common GCS bucket? Need to dig into tensorflow/agents code more. |
Now reachable on the Slack. I can clarify that or change the approach. The reason for running four replicas of the same job was that I have been seeing variability in performance from one job to the next, see #1 (comment). This seems less pronounced since switching to update_every=60 but my sample size in each case is only n=4. And perhaps may not occur after updating to the most recent agents version which you might want for this PR. Currently using the combination of versions of agents and dependencies last know to produce a functioning result; some combination of modifications to these yielded consistently poor results, see #1 (comment). Yeah that's interesting. With OpenAI Universe you can automatically launch local or remote simulations, e.g. # Local
import gym
import universe # register the universe environments
env = gym.make('flashgames.DuskDrive-v0')
env.configure(remotes=1) # automatically creates a local docker container
observation_n = env.reset() # Remote
env = gym.make('flashgames.DuskDrive-v0')
# Manually specify the addresses of the remotes
env.configure(remotes='vnc://localhost:5900+15900,vnc://localhost:5901+15901')
observation_n = env.reset() The quay.io/openai/universe.flashgames entrypoint starts up a VNC server which I think listens for actions and sends rewards on 5900 and 15900 by default. So to access a bunch of simulator pods from the training pod in this approach would just involve the training pod knowing the addresses of all the simulator pods. Which could be accomplished in a similar way to or by extending the TF_CONFIG cluster spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cwbeitel : First pass, without having run a thing.
I did learn a lot - the content is really good and it would take me a long time and a lot of documentation digging to figure a lot of it out if your example didn't exist.
Most of my comments in this pass are about polishing up the presentation and adding more exposition on what kubeflow is doing under the hood.
Next pass (tomorrow) will be more about the actual process of running this example.
agents/build.sh
Outdated
SALT=`python -c 'import datetime; import uuid; now=datetime.datetime.now(); print(now.strftime("%m%d-%H%M") + "-" + uuid.uuid4().hex[0:4])'` | ||
|
||
VERSION_TAG=tf${TENSORFLOW_IMAGE_TAG}-${SALT} | ||
IMAGE_TAG=gcr.io/${PROJECT_ID}/${IMAGE_BASE_NAME}:${VERSION_TAG} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cwbeitel , @jlewi : Can we make images publicly available under a kubeflow project? Would save users having to go through the trouble of this build step.
We should be publishing to DockerHub (re: kubeflow/kubeflow#211) as the primary source of images (as well as GCR, if we have a project handy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True it would be simpler to use if the example were contained in a publicly-accessible image. People would be able to either run the example from that image or make modifications and build their own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the image for?
In general users have to build their own Docker images so we should figure out how to make that CUJ work within cluster.
I think the general solution is going to be by building docker images in cluster using Docker in Docker using Argo.
You can look at the Seldon example
https://github.com/SeldonIO/kubeflow-seldon-example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could probably be done in a follow up PR though.
agents/build.sh
Outdated
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
|
||
IMAGE_BASE_NAME=agents | ||
TENSORFLOW_IMAGE_TAG=1.4.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us parametrize this?
TENSORFLOW_IMAGE_TAG=${TENSORFLOW_IMAGE_TAG:-1.4.1}
tf.logging.info("Preparing local files for upload:\n %s" % local_files) | ||
|
||
# Initialize the GCS API client | ||
storage_client = storage.Client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use tf.gfile
here instead?
(Even if the workers are writing to local disk, you can use tf.gfile.Copy
.)
agents/trainer/task.py
Outdated
FLAGS.hparam_set_id, log_dir, run_config.is_chief) | ||
|
||
if FLAGS.run_mode == 'train': | ||
if run_config.is_chief: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only intended to run on a single node, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right.
@@ -0,0 +1,76 @@ | |||
# tf-job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should anything under vendor/kubeflow
get checked in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, now that I think of it, thank you. It could either be submoduled in or there could be a command that obtains the dependency at setup.
agents/demo/demo.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Download and install ksonnet if needed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we run this notebook in a Docker container with the right packages installed? (That same container could be used with JupyterLab)
agents/demo/demo.ipynb
Outdated
"output_type": "stream", | ||
"text": [ | ||
"NAME READY STATUS RESTARTS AGE\r\n", | ||
"kuka-0208-1010-10df-master-znly-0-srftb 0/1 Completed 0 5h\r\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noisy?
agents/demo/demo.ipynb
Outdated
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"KEY_FILE=\"/Users/cb/Downloads/kubeflow-rl-ec0f4f646339.json\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Messy to have your local path + filename in the notebook.
agents/demo/demo.ipynb
Outdated
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We'll use [ksonnet](https://ksonnet.io/) to parameterize and apply a TFJob configuration (i.e. run a job). Here you can change the image to be a custom job image, such as one built and deployed with build.sh, or use the one provided here if you only want to change parameters. Below we'll display the templated job YAML for reference." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like more explanation about what kubeflow
is doing at each step, if the objective is to teach readers about more fundamental kubeflow-related concepts.
Would be good to link back to kubeflow docs, maybe the user guide as part of that.
agents/demo/demo.ipynb
Outdated
"\u001b[34mINFO \u001b[0mParameter 'name' successfully set to '\"render-0208-1335-6d81\"' for component 'agents_render'\n", | ||
"\u001b[34mINFO \u001b[0mParameter 'log_dir' successfully set to '\"gs://kubeflow-rl-kf/studies/replicated-env-comparison/pendulum-0208-1010-7b7f\"' for component 'agents_render'\n", | ||
"\u001b[34mINFO \u001b[0mParameter 'image' successfully set to '\"gcr.io/kubeflow-rl/agents:tf1.4.1-0208-0959-a276\"' for component 'agents_render'\n", | ||
"\u001b[34mINFO \u001b[0mUpdating tfjobs rl.render-0208-1335-6d81\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this output is distracting, was your leaving it in the notebook intentional?
Running multiple training runs with the same hyper parameters is a good first step. In RL, we often use this to show that how robust or not robust a method is to the random seed. After getting this merged, I'd be great to also be able to run different configurations in parallel in order to find good hyper parameters. Regarding distributing a single agent, I've heard several people complain about the SyncReplicasOptimizer. How much does that actually help us? Do you think it would be an option to manually average the gradients? In pseudo code: all_gradients = []
for worker in workers:
with tf.device(worker):
experience = collect_experience()
loss = compute_loss(experience)
gradients = optimizer.compute_gradients(loss)
all_gradients.append(gradients)
combined_gradients = tf.reduce_sum(all_gradients)
for worker in workers:
with tf.device(worker):
optimizer.apply_gradients(combined_gradients) |
@danijar : As the different parameters all live in distinct tf-job processes, would it be easy to get access to them with something like If so, your pattern is great (tried and true) Out of curiosity, what are the complaints about |
Tasks
|
FYI steps related to managing GCP credentials both for running the demo container locally as well as on JupyterHub are a work in progress. |
There's an issue of whether to combine the training container and the demo containers in the interest of both having smaller images to ship around, simpler development, and less redundency. On the demo side it would be nice to be able to run renders in the notebook so having the ffmpeg etc. dependencies from the training container seem to fit. But on the side of the training container the base image size would go up many times: gcr.io/kubeflow/tensorflow-notebook-cpu latest 169f860f283c 6 days ago 5.6GB
tensorflow/tensorflow 1.5.0 a2d1671e8a93 3 weeks ago 1.25GB
tensorflow/tensorflow 1.4.1 3cac069c491d 3 weeks ago 1.28GB
gcr.io/kubeflow/tensorflow-notebook-gpu latest e68d36c67064 2 months ago 7.11GB This is only relevant the first time someone runs a training job and if these are being initiated from the demo notebook then caches of the above will already exist in the cluster so the size difference will be irrelevant. But when building containers with Argo a different docker daemon is used and the build time would go up significantly at least on the first build and until builds are cached on every build; it takes a rather long time to pull 7GB. Therefore at the moment I have added dependencies from the trainer to the demo notebook but not combined the two. |
Thinking more about it looks like all of the setup steps in the notebook can be encapsulated in a setup function or DevEnvironment object. In addition if you could build images from the notebook dev environment that would simplify things and be more familiar for developers but not sure if there is a docker daemon accessible. One outstanding issue seems to be providing users credentials to perform operations within their own namespace on the underlying Kubernetes cluster without them needing to obtain these through gsutil or other means. Also with the way things are configured now "!gcloud container clusters get-credentials kubeflow-dev --zone us-east1-d" just works by default without the need to upload a service account but the idea is to make it vendor-agnostic... |
So simplified things significantly using Google Container Builder and using the credentials that are already available. Not caching container builds / waiting on builds is a significant point of friction. Will experiment GCB caching but this shouldn't block an everything-but review. Added Docker to demo notebook but there is not a docker daemon available. This would make it a more complete dev environment and simplify builds of the training container. But don't really need to re-build the training container for each run - if the dependencies haven't changed you could just mount the code via PVC shared with the notebook. Also in my experience it's important to be able to install system packages if developing in the notebook. Currently various pip install commands fail lacking sufficient permissions. |
Thinking it would make sense to re-work each step of the notebook as a python command (instead of system command) so the example could be tested end-to-end. |
@cwbeitel It seems like you are continuing to make some big changes to this PR. Should we move those changes into a new PR/branch or should we move this PR back to the WIP stage? |
@jlewi Sure, I think it would be good to start turning this into a series of small PR's so maybe we can let the current version be the stopping point for this PR then I'll make individual issues and PR's for changes subsequently, discussing those at least briefly ahead of time. Does that sound good to you? |
Sounds good
…On Feb 22, 2018 3:20 PM, "Christopher Beitel" ***@***.***> wrote:
@jlewi <https://github.com/jlewi> Sure, I think it would be good to start
turning this into a series of small PR's so maybe we can let the current
version be the stopping point for this PR then I'll make individual issues
and PR's for changes subsequently, discussing those at least briefly ahead
of time. Does that sound good to you?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAvcAwC6FNu23JBZqT7RqffwZucyowtAks5tXfZKgaJpZM4R3p7u>
.
|
@cwbeitel What needs to happen for this PR to be ready for review? |
I don't know. Feel free to review it now and let me know what you think still needs to be done for this particular PR. Adding the triggering of automated tests might be an addition or it might be a necessary requirement. If I was reviewing this I'd say you need to have some basic tests that trigger automatically when you push a commit but testing it interactively may be sufficient initially. Being able to either (1) quickly build training containers or (2) not need to build training containers is needed but not necessarily for this PR. |
/assign @nkashy1 @nkashy1 Can you please review? |
Not sure how to rebase given example has been on master branch so closing this, re-forking, creating branch for example, and adding current version of PR with tests. |
Outstanding tasks for this PR:
This change is