Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add estimator example for github issues #203

Merged
merged 11 commits into from
Aug 25, 2018
Merged

Conversation

inc0
Copy link

@inc0 inc0 commented Jul 27, 2018

This is code input for doc about writing Keras for tfjob.

There are few todos:

  1. bug in dataset injection, can't raise number of steps
  2. intead of adding hostpath for data, we should have quick job + pvc
    for this

Fixes #196


This change is Reviewable

This is code input for doc about writing Keras for tfjob.

There are few todos:

1. bug in dataset injection, can't raise number of steps
2. intead of adding hostpath for data, we should have quick job + pvc
for this
@inc0
Copy link
Author

inc0 commented Jul 27, 2018

/assign @jlewi
/assign @ankushagarwal

@jlewi
Copy link
Contributor

jlewi commented Jul 27, 2018

Can we replace existing code (e.g. delete some files) rather than creating yet another way to do training?

Did you get distributed training working with K8s?

@jlewi
Copy link
Contributor

jlewi commented Jul 27, 2018

Please update the PR description to provide more information about what's different about this PR from what currently exists for the example. Also can you explain how it fits overall into the GH issue summarization example? Does serving work? Did you test the trained model to see what sort of quality of results your are getting?

@inc0
Copy link
Author

inc0 commented Jul 27, 2018

Distributed training works, I still need to fix one bug in this PR to make it fully functional tho.
I was thinking of writing full fledged guide for website about keras+tfjob, because estimator approach should be 100% applicable to any tf.keras code out there, making working with tfjob much easier in general.

@inc0
Copy link
Author

inc0 commented Jul 27, 2018

Can we replace existing code (e.g. delete some files) rather than creating yet another way to do training?

I think we could and should collapse training to single method. I just don't want to throw away t2t approach until we decide to do it together. For now, feel free to test out this approach

@inc0 inc0 changed the title Add estimator example for github issues [wip] Add estimator example for github issues Jul 27, 2018
@jlewi
Copy link
Contributor

jlewi commented Jul 29, 2018

There's two ways of doing training right now

  1. Using Keras
  2. Using T2T

This PR is duplicating #1. So can we get rid of the existing Keras training code and just use this code?

print(session.run(hello))


def plot_model_training_history(history_object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use TensorBoard?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is leftover from notebook, I'll clean up these from code, seq2seq utils right now is just copy paste

@inc0
Copy link
Author

inc0 commented Jul 30, 2018

This PR is duplicating #1. So can we get rid of the existing Keras training code and just use this code

What do you mean by "this code" instead of Keras? It is Keras code..

@jlewi
Copy link
Contributor

jlewi commented Jul 31, 2018

https://github.com/kubeflow/examples/blob/master/github_issue_summarization/notebooks/train.py

Is also using Keras; although I doubt its using TF.Estimator. My question is do we need both of these? Or can we get rid of the existing notebooks/train.py?

@inc0
Copy link
Author

inc0 commented Jul 31, 2018

Is also using Keras; although I doubt its using TF.Estimator. My question is do we need both of these? Or can we get rid of the existing notebooks/train.py?

My goal is to provide next version of it really, I might replace original if we're ok with it. Original doesn't use estimator and I don't think keras natively supports distributed training. Estimator approach allows for both single node local and distributed.

@jlewi
Copy link
Contributor

jlewi commented Aug 1, 2018

SGTM. My point is I don't think we want to keep maintaining multiple versions of the code/example. So the plan should be to make this a suitable replacement so that we can get rid of what currently exists.

@texasmichelle
Copy link
Member

Yes, please replace the previous version with this one

@jlewi
Copy link
Contributor

jlewi commented Aug 10, 2018

@inc0 How's this coming?

@inc0 inc0 changed the title [wip] Add estimator example for github issues Add estimator example for github issues Aug 20, 2018
Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 14 files reviewed, 7 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)


github_issue_summarization/02_distributed_training.md, line 44 at r4 (raw file):

Master should always have 1 replica. This is main worker which will show us status of overall job.

PS, or Parameter server, is Pod that will hold all weights. It can have any number of replicas, recommended to have more than 1 for high availability.

More than 1 PS doesn't give high availability; so I'd suggest deleting that line. Each PS is storing different parameters. You generally want to use multiple PS when you become IO bound trying to update all parameters on a single server; so by distributing across multiple servers you can get more bandwidth.


github_issue_summarization/02_distributed_training.md, line 56 at r4 (raw file):

There are few things required for this approach to work.

First we need to parse clustering variable. This is required to run different logic per node role

clustering?


github_issue_summarization/distributed/download.sh, line 3 at r4 (raw file):

#!/usr/bin/env bash

export DATA_DIR="/data"

Why not use the existing script and job
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/download_data.sh


github_issue_summarization/distributed/seq2seq_utils.py, line 1 at r4 (raw file):

import logging

Is this the same file as
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/notebooks/seq2seq_utils.py?


github_issue_summarization/distributed/tfjob.yaml, line 5 at r4 (raw file):

---
kind: PersistentVolumeClaim

Why aren't these in the ksonnet app and parameterized?
https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow

To make it easy for users to run them.

We already have an example here:
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/data-pvc.jsonnet

Why not follow what we currently do with PVC and allow the job to be run non-distributed using a ReadWriteOnce PV?

If you parameterize the claims in ksonnet you can easily set the accessMode to ReadWriteOnce or ReadWriteMany based on the mode.


github_issue_summarization/distributed/tfjob.yaml, line 28 at r4 (raw file):

      storage: 50Gi
---
apiVersion: batch/v1

Do we need this?
Why can't we use
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/data-downloader.jsonnet

@jlewi
Copy link
Contributor

jlewi commented Aug 22, 2018

I did another pass. It seems like there's a fair bit of code duplication with the existing Keras example. Should that be fixed before submitting the PR?

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 12 files reviewed, 10 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)


github_issue_summarization/02_distributed_training.md, line 21 at r5 (raw file):

## How to run it

Assuming you have already setup your Kubeflow cluster, all you need to do to try it out:

Don't they need to setup a storage class which is ReadWriteMany?
Can we change the defaults in the job to be ReadWriteOnce and then provide instructions for how to create a ReadWriteMany PV?

On GKE they can follow these instructions to use GCFS to create a ReadWriteMany PV
https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow


github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

# You will need NFS storage class, or any other ReadWriteMany storageclass

I know you object to using ksonnet so I'm not going to block this PR on but using ksonnet. Seems much more convenient as a way of giving users a way to easily override different parameters.

Would you be open to using ksonnet as the source of truth and checking in YAML manifests generated from the ksonnet app?

This way people can reference the YAML if they want but when they want to start changing things they can use the ks template?


github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):

apiVersion: batch/v1
kind: Job
metadata:

Putting all the resources in the same YAML file seems problematic; they don't have the same lifetime.
For example
doing
kubectl create -f tfjob.yaml would launch the download job and the training job at the same time.

At a minimum shouldn't there be 3 separate YAML files

  1. for PV and PVC
  2. One for download job
  3. One for TFJob

@jlewi
Copy link
Contributor

jlewi commented Aug 23, 2018

I think there were some unaddressed comments on 02_distributed_training.md.

I also left some comments on tf_job.yaml; I think putting all the manifests in one file is going to be problematic.

Copy link
Author

@inc0 inc0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 12 files reviewed, 10 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)


github_issue_summarization/02_distributed_training.md, line 44 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

More than 1 PS doesn't give high availability; so I'd suggest deleting that line. Each PS is storing different parameters. You generally want to use multiple PS when you become IO bound trying to update all parameters on a single server; so by distributing across multiple servers you can get more bandwidth.

My bad, I fixed it on different computer;) let me push it..


github_issue_summarization/02_distributed_training.md, line 21 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Don't they need to setup a storage class which is ReadWriteMany?
Can we change the defaults in the job to be ReadWriteOnce and then provide instructions for how to create a ReadWriteMany PV?

On GKE they can follow these instructions to use GCFS to create a ReadWriteMany PV
https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow

I've put that in requirements on top of document, do you want me to repeat it here?

Copy link
Author

@inc0 inc0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 12 files reviewed, 10 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)


github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I know you object to using ksonnet so I'm not going to block this PR on but using ksonnet. Seems much more convenient as a way of giving users a way to easily override different parameters.

Would you be open to using ksonnet as the source of truth and checking in YAML manifests generated from the ksonnet app?

This way people can reference the YAML if they want but when they want to start changing things they can use the ks template?

I'll see if it can output yaml, unless it does, I'd keep it with yaml to be able to comment it and yaml (personal opinion) feels more readable


github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Putting all the resources in the same YAML file seems problematic; they don't have the same lifetime.
For example
doing
kubectl create -f tfjob.yaml would launch the download job and the training job at the same time.

At a minimum shouldn't there be 3 separate YAML files

  1. for PV and PVC
  2. One for download job
  3. One for TFJob

Well, I did include race conditions in code, so this works even with creating everything at the same time, but I can split it. I'd split it to disks+download -> prep.yaml and tfjob.yaml, how about that?

Copy link
Contributor

@jlewi jlewi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 12 files reviewed, 8 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)


github_issue_summarization/02_distributed_training.md, line 21 at r5 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

I've put that in requirements on top of document, do you want me to repeat it here?

Can you add a link at the top to the GCFS instructions please?


github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

I'll see if it can output yaml, unless it does, I'd keep it with yaml to be able to comment it and yaml (personal opinion) feels more readable

Do you mean "input" yaml?

ksonnet allows you to use YAML but there's no easy way to add parameters (you can only do overlays).

But one solution would be to just have two different YAML components
1 for ReadWriteMany PV
1 for ReadWriteOnce


github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

Well, I did include race conditions in code, so this works even with creating everything at the same time, but I can split it. I'd split it to disks+download -> prep.yaml and tfjob.yaml, how about that?

Yes please split.
What does "include race conditions in code mean"?

Do you mean your code is waiting on things like the the PV and data becoming available?

Copy link
Author

@inc0 inc0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 12 files reviewed, 8 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)


github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Do you mean "input" yaml?

ksonnet allows you to use YAML but there's no easy way to add parameters (you can only do overlays).

But one solution would be to just have two different YAML components
1 for ReadWriteMany PV
1 for ReadWriteOnce

I meant "ks show default", but it already outputs yaml so I can use it.


github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Yes please split.
What does "include race conditions in code mean"?

Do you mean your code is waiting on things like the the PV and data becoming available?

yeah, if you do kubectl apply it'll just work ootb

Copy link
Author

@inc0 inc0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 12 files reviewed, 8 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)


github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):

Previously, inc0 (Michał Jastrzębski) wrote…

I meant "ks show default", but it already outputs yaml so I can use it.

ad being open to ks, I think we'll need to do it in following patchsets - https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/tfjob.libsonnet#L30 this isn't really translatable to my code. Let's do it but let's do it in following prs plz

@jlewi
Copy link
Contributor

jlewi commented Aug 25, 2018

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 35786ed into kubeflow:master Aug 25, 2018
jlewi added a commit to jlewi/examples that referenced this pull request Nov 6, 2018
Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.
k8s-ci-robot pushed a commit that referenced this pull request Nov 8, 2018
…#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See #196
The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
yixinshi pushed a commit to yixinshi/examples that referenced this pull request Nov 30, 2018
* Add estimator example for github issues

This is code input for doc about writing Keras for tfjob.

There are few todos:

1. bug in dataset injection, can't raise number of steps
2. intead of adding hostpath for data, we should have quick job + pvc
for this

* pyling

* wip

* confirmed working on minikube

* pylint

* remove t2t, add documentation

* add note about storageclass

* fix link

* remove code redundancy

* adress review

* small language fix
yixinshi pushed a commit to yixinshi/examples that referenced this pull request Nov 30, 2018
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this pull request Dec 6, 2018
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this pull request Apr 1, 2019
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants