Add estimator example for github issues #203

inc0 · 2018-07-27T16:38:45Z

This is code input for doc about writing Keras for tfjob.

There are few todos:

bug in dataset injection, can't raise number of steps
intead of adding hostpath for data, we should have quick job + pvc
for this

Fixes #196

This change is

This is code input for doc about writing Keras for tfjob. There are few todos: 1. bug in dataset injection, can't raise number of steps 2. intead of adding hostpath for data, we should have quick job + pvc for this

inc0 · 2018-07-27T16:40:41Z

/assign @jlewi
/assign @ankushagarwal

jlewi · 2018-07-27T17:45:49Z

Can we replace existing code (e.g. delete some files) rather than creating yet another way to do training?

Did you get distributed training working with K8s?

jlewi · 2018-07-27T17:49:28Z

Please update the PR description to provide more information about what's different about this PR from what currently exists for the example. Also can you explain how it fits overall into the GH issue summarization example? Does serving work? Did you test the trained model to see what sort of quality of results your are getting?

inc0 · 2018-07-27T17:51:33Z

Distributed training works, I still need to fix one bug in this PR to make it fully functional tho.
I was thinking of writing full fledged guide for website about keras+tfjob, because estimator approach should be 100% applicable to any tf.keras code out there, making working with tfjob much easier in general.

inc0 · 2018-07-27T17:53:42Z

Can we replace existing code (e.g. delete some files) rather than creating yet another way to do training?

I think we could and should collapse training to single method. I just don't want to throw away t2t approach until we decide to do it together. For now, feel free to test out this approach

jlewi · 2018-07-29T23:56:05Z

There's two ways of doing training right now

Using Keras
Using T2T

This PR is duplicating #1. So can we get rid of the existing Keras training code and just use this code?

jlewi · 2018-07-29T23:56:41Z

github_issue_summarization/estimator/seq2seq_utils.py

+  print(session.run(hello))
+
+
+def plot_model_training_history(history_object):


Can we not use TensorBoard?

I think this is leftover from notebook, I'll clean up these from code, seq2seq utils right now is just copy paste

inc0 · 2018-07-30T17:13:02Z

This PR is duplicating #1. So can we get rid of the existing Keras training code and just use this code

What do you mean by "this code" instead of Keras? It is Keras code..

jlewi · 2018-07-31T06:57:01Z

https://github.com/kubeflow/examples/blob/master/github_issue_summarization/notebooks/train.py

Is also using Keras; although I doubt its using TF.Estimator. My question is do we need both of these? Or can we get rid of the existing notebooks/train.py?

inc0 · 2018-07-31T18:21:18Z

Is also using Keras; although I doubt its using TF.Estimator. My question is do we need both of these? Or can we get rid of the existing notebooks/train.py?

My goal is to provide next version of it really, I might replace original if we're ok with it. Original doesn't use estimator and I don't think keras natively supports distributed training. Estimator approach allows for both single node local and distributed.

jlewi · 2018-08-01T13:02:36Z

SGTM. My point is I don't think we want to keep maintaining multiple versions of the code/example. So the plan should be to make this a suitable replacement so that we can get rid of what currently exists.

texasmichelle · 2018-08-02T21:15:28Z

Yes, please replace the previous version with this one

jlewi · 2018-08-10T13:53:03Z

@inc0 How's this coming?

… example. * see kubeflow/examples#203 (comment)

… example. (#9) * see kubeflow/examples#203 (comment)

jlewi

Reviewable status: 0 of 14 files reviewed, 7 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)

github_issue_summarization/02_distributed_training.md, line 44 at r4 (raw file):

Master should always have 1 replica. This is main worker which will show us status of overall job.

PS, or Parameter server, is Pod that will hold all weights. It can have any number of replicas, recommended to have more than 1 for high availability.

More than 1 PS doesn't give high availability; so I'd suggest deleting that line. Each PS is storing different parameters. You generally want to use multiple PS when you become IO bound trying to update all parameters on a single server; so by distributing across multiple servers you can get more bandwidth.

github_issue_summarization/02_distributed_training.md, line 56 at r4 (raw file):

There are few things required for this approach to work.

First we need to parse clustering variable. This is required to run different logic per node role

clustering?

github_issue_summarization/distributed/download.sh, line 3 at r4 (raw file):

#!/usr/bin/env bash

export DATA_DIR="/data"

Why not use the existing script and job
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/download_data.sh

github_issue_summarization/distributed/seq2seq_utils.py, line 1 at r4 (raw file):

import logging

Is this the same file as
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/notebooks/seq2seq_utils.py?

github_issue_summarization/distributed/tfjob.yaml, line 5 at r4 (raw file):

---
kind: PersistentVolumeClaim

Why aren't these in the ksonnet app and parameterized?
https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow

To make it easy for users to run them.

We already have an example here:
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/data-pvc.jsonnet

Why not follow what we currently do with PVC and allow the job to be run non-distributed using a ReadWriteOnce PV?

If you parameterize the claims in ksonnet you can easily set the accessMode to ReadWriteOnce or ReadWriteMany based on the mode.

github_issue_summarization/distributed/tfjob.yaml, line 28 at r4 (raw file):

      storage: 50Gi
---
apiVersion: batch/v1

Do we need this?
Why can't we use
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/data-downloader.jsonnet

jlewi · 2018-08-22T04:39:32Z

I did another pass. It seems like there's a fair bit of code duplication with the existing Keras example. Should that be fixed before submitting the PR?

jlewi

Reviewable status: 0 of 12 files reviewed, 10 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)

github_issue_summarization/02_distributed_training.md, line 21 at r5 (raw file):

## How to run it

Assuming you have already setup your Kubeflow cluster, all you need to do to try it out:

Don't they need to setup a storage class which is ReadWriteMany?
Can we change the defaults in the job to be ReadWriteOnce and then provide instructions for how to create a ReadWriteMany PV?

On GKE they can follow these instructions to use GCFS to create a ReadWriteMany PV
https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow

github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

# You will need NFS storage class, or any other ReadWriteMany storageclass

I know you object to using ksonnet so I'm not going to block this PR on but using ksonnet. Seems much more convenient as a way of giving users a way to easily override different parameters.

Would you be open to using ksonnet as the source of truth and checking in YAML manifests generated from the ksonnet app?

This way people can reference the YAML if they want but when they want to start changing things they can use the ks template?

github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):

apiVersion: batch/v1
kind: Job
metadata:

Putting all the resources in the same YAML file seems problematic; they don't have the same lifetime.
For example
doing
kubectl create -f tfjob.yaml would launch the download job and the training job at the same time.

At a minimum shouldn't there be 3 separate YAML files

for PV and PVC
One for download job
One for TFJob

jlewi · 2018-08-23T23:48:45Z

I think there were some unaddressed comments on 02_distributed_training.md.

I also left some comments on tf_job.yaml; I think putting all the manifests in one file is going to be problematic.

inc0

Reviewable status: 0 of 12 files reviewed, 10 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)

github_issue_summarization/02_distributed_training.md, line 44 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

More than 1 PS doesn't give high availability; so I'd suggest deleting that line. Each PS is storing different parameters. You generally want to use multiple PS when you become IO bound trying to update all parameters on a single server; so by distributing across multiple servers you can get more bandwidth.

My bad, I fixed it on different computer;) let me push it..

github_issue_summarization/02_distributed_training.md, line 21 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Don't they need to setup a storage class which is ReadWriteMany?
Can we change the defaults in the job to be ReadWriteOnce and then provide instructions for how to create a ReadWriteMany PV?

On GKE they can follow these instructions to use GCFS to create a ReadWriteMany PV
https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow

I've put that in requirements on top of document, do you want me to repeat it here?

inc0

Reviewable status: 0 of 12 files reviewed, 10 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)

github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I know you object to using ksonnet so I'm not going to block this PR on but using ksonnet. Seems much more convenient as a way of giving users a way to easily override different parameters.

Would you be open to using ksonnet as the source of truth and checking in YAML manifests generated from the ksonnet app?

This way people can reference the YAML if they want but when they want to start changing things they can use the ks template?

I'll see if it can output yaml, unless it does, I'd keep it with yaml to be able to comment it and yaml (personal opinion) feels more readable

github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Putting all the resources in the same YAML file seems problematic; they don't have the same lifetime.
For example
doing
kubectl create -f tfjob.yaml would launch the download job and the training job at the same time.

At a minimum shouldn't there be 3 separate YAML files

for PV and PVC

One for download job

One for TFJob

Well, I did include race conditions in code, so this works even with creating everything at the same time, but I can split it. I'd split it to disks+download -> prep.yaml and tfjob.yaml, how about that?

jlewi

Reviewable status: 0 of 12 files reviewed, 8 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)

github_issue_summarization/02_distributed_training.md, line 21 at r5 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

I've put that in requirements on top of document, do you want me to repeat it here?

Can you add a link at the top to the GCFS instructions please?

github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

I'll see if it can output yaml, unless it does, I'd keep it with yaml to be able to comment it and yaml (personal opinion) feels more readable

Do you mean "input" yaml?

ksonnet allows you to use YAML but there's no easy way to add parameters (you can only do overlays).

But one solution would be to just have two different YAML components
1 for ReadWriteMany PV
1 for ReadWriteOnce

github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

Well, I did include race conditions in code, so this works even with creating everything at the same time, but I can split it. I'd split it to disks+download -> prep.yaml and tfjob.yaml, how about that?

Yes please split.
What does "include race conditions in code mean"?

Do you mean your code is waiting on things like the the PV and data becoming available?

inc0

Reviewable status: 0 of 12 files reviewed, 8 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)

github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Do you mean "input" yaml?

ksonnet allows you to use YAML but there's no easy way to add parameters (you can only do overlays).

But one solution would be to just have two different YAML components
1 for ReadWriteMany PV
1 for ReadWriteOnce

I meant "ks show default", but it already outputs yaml so I can use it.

github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Yes please split.
What does "include race conditions in code mean"?

Do you mean your code is waiting on things like the the PV and data becoming available?

yeah, if you do kubectl apply it'll just work ootb

inc0

Reviewable status: 0 of 12 files reviewed, 8 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)

github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

I meant "ks show default", but it already outputs yaml so I can use it.

ad being open to ks, I think we'll need to do it in following patchsets - https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/tfjob.libsonnet#L30 this isn't really translatable to my code. Let's do it but let's do it in following prs plz

jlewi · 2018-08-25T01:03:49Z

/lgtm
/approve

k8s-ci-robot · 2018-08-25T01:03:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

…#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See #196 The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.

* Add estimator example for github issues This is code input for doc about writing Keras for tfjob. There are few todos: 1. bug in dataset injection, can't raise number of steps 2. intead of adding hostpath for data, we should have quick job + pvc for this * pyling * wip * confirmed working on minikube * pylint * remove t2t, add documentation * add note about storageclass * fix link * remove code redundancy * adress review * small language fix

…kubeflow#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.

Add estimator example for github issues

d140350

This is code input for doc about writing Keras for tfjob. There are few todos: 1. bug in dataset injection, can't raise number of steps 2. intead of adding hostpath for data, we should have quick job + pvc for this

k8s-ci-robot requested review from DjangoPeng and wbuchwalter July 27, 2018 16:38

k8s-ci-robot added the size/XL label Jul 27, 2018

k8s-ci-robot assigned ankushagarwal and jlewi Jul 27, 2018

pyling

7f94d03

inc0 changed the title ~~Add estimator example for github issues~~ [wip] Add estimator example for github issues Jul 27, 2018

k8s-ci-robot added the do-not-merge/work-in-progress label Jul 27, 2018

jlewi reviewed Jul 29, 2018

View reviewed changes

jlewi added a commit to jlewi/internal-acls that referenced this pull request Aug 10, 2018

Give inc0@ access to kubeflow-dev so that he can work on the GH issue…

78f459a

… example. * see kubeflow/examples#203 (comment)

jlewi mentioned this pull request Aug 10, 2018

Give inc0@ access to kubeflow-dev so that he can work on the GH issue example kubeflow/internal-acls#9

Merged

k8s-ci-robot pushed a commit to kubeflow/internal-acls that referenced this pull request Aug 10, 2018

Give inc0@ access to kubeflow-dev so that he can work on the GH issue…

f1a3761

… example. (#9) * see kubeflow/examples#203 (comment)

inc0 added 3 commits August 13, 2018 12:28

wip

5d782a6

confirmed working on minikube

5d64e72

pylint

37b1b57

inc0 changed the title ~~[wip] Add estimator example for github issues~~ Add estimator example for github issues Aug 20, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress label Aug 20, 2018

fix link

f195a32

jlewi suggested changes Aug 22, 2018

View reviewed changes

remove code redundancy

a25b751

k8s-ci-robot added size/XL and removed size/XXL labels Aug 23, 2018

jlewi suggested changes Aug 23, 2018

View reviewed changes

inc0 commented Aug 24, 2018

View reviewed changes

jlewi suggested changes Aug 24, 2018

View reviewed changes

inc0 commented Aug 24, 2018

View reviewed changes

inc0 added 2 commits August 24, 2018 11:22

adress review

4273922

small language fix

045af2e

k8s-ci-robot added the lgtm label Aug 25, 2018

k8s-ci-robot added the approved label Aug 25, 2018

k8s-ci-robot merged commit 35786ed into kubeflow:master Aug 25, 2018

jlewi mentioned this pull request Oct 9, 2018

[GH Issue Summarization] Fix tfjob training #186

Closed

This was referenced Oct 23, 2018

[GH Issue Summarization] Keras model doesn't work with keras in TensorFlow #280

Closed

[gh_issue_summarization] distributed training using Keras #196

Closed

A bunch of changes to support distributed training using tf.estimator #265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add estimator example for github issues #203

Add estimator example for github issues #203

inc0 commented Jul 27, 2018 •

edited by jlewi

Loading

inc0 commented Jul 27, 2018

jlewi commented Jul 27, 2018

jlewi commented Jul 27, 2018

inc0 commented Jul 27, 2018

inc0 commented Jul 27, 2018

jlewi commented Jul 29, 2018

jlewi Jul 29, 2018

inc0 Jul 30, 2018

inc0 commented Jul 30, 2018

jlewi commented Jul 31, 2018

inc0 commented Jul 31, 2018

jlewi commented Aug 1, 2018

texasmichelle commented Aug 2, 2018

jlewi commented Aug 10, 2018

jlewi left a comment

jlewi commented Aug 22, 2018

jlewi left a comment

jlewi commented Aug 23, 2018

inc0 left a comment

inc0 left a comment

jlewi left a comment

inc0 left a comment

inc0 left a comment

jlewi commented Aug 25, 2018

k8s-ci-robot commented Aug 25, 2018

		print(session.run(hello))


		def plot_model_training_history(history_object):

Add estimator example for github issues #203

Add estimator example for github issues #203

Conversation

inc0 commented Jul 27, 2018 • edited by jlewi Loading

inc0 commented Jul 27, 2018

jlewi commented Jul 27, 2018

jlewi commented Jul 27, 2018

inc0 commented Jul 27, 2018

inc0 commented Jul 27, 2018

jlewi commented Jul 29, 2018

jlewi Jul 29, 2018

Choose a reason for hiding this comment

inc0 Jul 30, 2018

Choose a reason for hiding this comment

inc0 commented Jul 30, 2018

jlewi commented Jul 31, 2018

inc0 commented Jul 31, 2018

jlewi commented Aug 1, 2018

texasmichelle commented Aug 2, 2018

jlewi commented Aug 10, 2018

jlewi left a comment

Choose a reason for hiding this comment

jlewi commented Aug 22, 2018

jlewi left a comment

Choose a reason for hiding this comment

jlewi commented Aug 23, 2018

inc0 left a comment

Choose a reason for hiding this comment

inc0 left a comment

Choose a reason for hiding this comment

jlewi left a comment

Choose a reason for hiding this comment

inc0 left a comment

Choose a reason for hiding this comment

inc0 left a comment

Choose a reason for hiding this comment

jlewi commented Aug 25, 2018

k8s-ci-robot commented Aug 25, 2018

inc0 commented Jul 27, 2018 •

edited by jlewi

Loading