-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add estimator example for github issues #203
Conversation
This is code input for doc about writing Keras for tfjob. There are few todos: 1. bug in dataset injection, can't raise number of steps 2. intead of adding hostpath for data, we should have quick job + pvc for this
/assign @jlewi |
Can we replace existing code (e.g. delete some files) rather than creating yet another way to do training? Did you get distributed training working with K8s? |
Please update the PR description to provide more information about what's different about this PR from what currently exists for the example. Also can you explain how it fits overall into the GH issue summarization example? Does serving work? Did you test the trained model to see what sort of quality of results your are getting? |
Distributed training works, I still need to fix one bug in this PR to make it fully functional tho. |
I think we could and should collapse training to single method. I just don't want to throw away t2t approach until we decide to do it together. For now, feel free to test out this approach |
There's two ways of doing training right now
This PR is duplicating #1. So can we get rid of the existing Keras training code and just use this code? |
print(session.run(hello)) | ||
|
||
|
||
def plot_model_training_history(history_object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not use TensorBoard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is leftover from notebook, I'll clean up these from code, seq2seq utils right now is just copy paste
What do you mean by "this code" instead of Keras? It is Keras code.. |
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/notebooks/train.py Is also using Keras; although I doubt its using TF.Estimator. My question is do we need both of these? Or can we get rid of the existing notebooks/train.py? |
My goal is to provide next version of it really, I might replace original if we're ok with it. Original doesn't use estimator and I don't think keras natively supports distributed training. Estimator approach allows for both single node local and distributed. |
SGTM. My point is I don't think we want to keep maintaining multiple versions of the code/example. So the plan should be to make this a suitable replacement so that we can get rid of what currently exists. |
Yes, please replace the previous version with this one |
@inc0 How's this coming? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 14 files reviewed, 7 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)
github_issue_summarization/02_distributed_training.md, line 44 at r4 (raw file):
Master should always have 1 replica. This is main worker which will show us status of overall job. PS, or Parameter server, is Pod that will hold all weights. It can have any number of replicas, recommended to have more than 1 for high availability.
More than 1 PS doesn't give high availability; so I'd suggest deleting that line. Each PS is storing different parameters. You generally want to use multiple PS when you become IO bound trying to update all parameters on a single server; so by distributing across multiple servers you can get more bandwidth.
github_issue_summarization/02_distributed_training.md, line 56 at r4 (raw file):
There are few things required for this approach to work. First we need to parse clustering variable. This is required to run different logic per node role
clustering?
github_issue_summarization/distributed/download.sh, line 3 at r4 (raw file):
#!/usr/bin/env bash export DATA_DIR="/data"
Why not use the existing script and job
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/download_data.sh
github_issue_summarization/distributed/seq2seq_utils.py, line 1 at r4 (raw file):
import logging
Is this the same file as
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/notebooks/seq2seq_utils.py?
github_issue_summarization/distributed/tfjob.yaml, line 5 at r4 (raw file):
--- kind: PersistentVolumeClaim
Why aren't these in the ksonnet app and parameterized?
https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow
To make it easy for users to run them.
We already have an example here:
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/data-pvc.jsonnet
Why not follow what we currently do with PVC and allow the job to be run non-distributed using a ReadWriteOnce PV?
If you parameterize the claims in ksonnet you can easily set the accessMode to ReadWriteOnce or ReadWriteMany based on the mode.
github_issue_summarization/distributed/tfjob.yaml, line 28 at r4 (raw file):
storage: 50Gi --- apiVersion: batch/v1
Do we need this?
Why can't we use
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/data-downloader.jsonnet
I did another pass. It seems like there's a fair bit of code duplication with the existing Keras example. Should that be fixed before submitting the PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 12 files reviewed, 10 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)
github_issue_summarization/02_distributed_training.md, line 21 at r5 (raw file):
## How to run it Assuming you have already setup your Kubeflow cluster, all you need to do to try it out:
Don't they need to setup a storage class which is ReadWriteMany?
Can we change the defaults in the job to be ReadWriteOnce and then provide instructions for how to create a ReadWriteMany PV?
On GKE they can follow these instructions to use GCFS to create a ReadWriteMany PV
https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow
github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):
# You will need NFS storage class, or any other ReadWriteMany storageclass
I know you object to using ksonnet so I'm not going to block this PR on but using ksonnet. Seems much more convenient as a way of giving users a way to easily override different parameters.
Would you be open to using ksonnet as the source of truth and checking in YAML manifests generated from the ksonnet app?
This way people can reference the YAML if they want but when they want to start changing things they can use the ks template?
github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):
apiVersion: batch/v1 kind: Job metadata:
Putting all the resources in the same YAML file seems problematic; they don't have the same lifetime.
For example
doing
kubectl create -f tfjob.yaml would launch the download job and the training job at the same time.
At a minimum shouldn't there be 3 separate YAML files
- for PV and PVC
- One for download job
- One for TFJob
I think there were some unaddressed comments on 02_distributed_training.md. I also left some comments on tf_job.yaml; I think putting all the manifests in one file is going to be problematic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 12 files reviewed, 10 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)
github_issue_summarization/02_distributed_training.md, line 44 at r4 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
More than 1 PS doesn't give high availability; so I'd suggest deleting that line. Each PS is storing different parameters. You generally want to use multiple PS when you become IO bound trying to update all parameters on a single server; so by distributing across multiple servers you can get more bandwidth.
My bad, I fixed it on different computer;) let me push it..
github_issue_summarization/02_distributed_training.md, line 21 at r5 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Don't they need to setup a storage class which is ReadWriteMany?
Can we change the defaults in the job to be ReadWriteOnce and then provide instructions for how to create a ReadWriteMany PV?On GKE they can follow these instructions to use GCFS to create a ReadWriteMany PV
https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow
I've put that in requirements on top of document, do you want me to repeat it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 12 files reviewed, 10 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)
github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
I know you object to using ksonnet so I'm not going to block this PR on but using ksonnet. Seems much more convenient as a way of giving users a way to easily override different parameters.
Would you be open to using ksonnet as the source of truth and checking in YAML manifests generated from the ksonnet app?
This way people can reference the YAML if they want but when they want to start changing things they can use the ks template?
I'll see if it can output yaml, unless it does, I'd keep it with yaml to be able to comment it and yaml (personal opinion) feels more readable
github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Putting all the resources in the same YAML file seems problematic; they don't have the same lifetime.
For example
doing
kubectl create -f tfjob.yaml would launch the download job and the training job at the same time.At a minimum shouldn't there be 3 separate YAML files
- for PV and PVC
- One for download job
- One for TFJob
Well, I did include race conditions in code, so this works even with creating everything at the same time, but I can split it. I'd split it to disks+download -> prep.yaml and tfjob.yaml, how about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 12 files reviewed, 8 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)
github_issue_summarization/02_distributed_training.md, line 21 at r5 (raw file):
Previously, inc0 (Michał Jastrzębski) wrote…
I've put that in requirements on top of document, do you want me to repeat it here?
Can you add a link at the top to the GCFS instructions please?
github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):
Previously, inc0 (Michał Jastrzębski) wrote…
I'll see if it can output yaml, unless it does, I'd keep it with yaml to be able to comment it and yaml (personal opinion) feels more readable
Do you mean "input" yaml?
ksonnet allows you to use YAML but there's no easy way to add parameters (you can only do overlays).
But one solution would be to just have two different YAML components
1 for ReadWriteMany PV
1 for ReadWriteOnce
github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):
Previously, inc0 (Michał Jastrzębski) wrote…
Well, I did include race conditions in code, so this works even with creating everything at the same time, but I can split it. I'd split it to disks+download -> prep.yaml and tfjob.yaml, how about that?
Yes please split.
What does "include race conditions in code mean"?
Do you mean your code is waiting on things like the the PV and data becoming available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 12 files reviewed, 8 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)
github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Do you mean "input" yaml?
ksonnet allows you to use YAML but there's no easy way to add parameters (you can only do overlays).
But one solution would be to just have two different YAML components
1 for ReadWriteMany PV
1 for ReadWriteOnce
I meant "ks show default", but it already outputs yaml so I can use it.
github_issue_summarization/distributed/tfjob.yaml, line 30 at r5 (raw file):
Previously, jlewi (Jeremy Lewi) wrote…
Yes please split.
What does "include race conditions in code mean"?Do you mean your code is waiting on things like the the PV and data becoming available?
yeah, if you do kubectl apply it'll just work ootb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 12 files reviewed, 8 unresolved discussions (waiting on @jlewi, @inc0, @DjangoPeng, and @wbuchwalter)
github_issue_summarization/distributed/tfjob.yaml, line 1 at r5 (raw file):
Previously, inc0 (Michał Jastrzębski) wrote…
I meant "ks show default", but it already outputs yaml so I can use it.
ad being open to ks, I think we'll need to do it in following patchsets - https://github.com/kubeflow/examples/blob/master/github_issue_summarization/ks-kubeflow/components/tfjob.libsonnet#L30 this isn't really translatable to my code. Let's do it but let's do it in following prs plz
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.
…#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See #196 The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.
* Add estimator example for github issues This is code input for doc about writing Keras for tfjob. There are few todos: 1. bug in dataset injection, can't raise number of steps 2. intead of adding hostpath for data, we should have quick job + pvc for this * pyling * wip * confirmed working on minikube * pylint * remove t2t, add documentation * add note about storageclass * fix link * remove code redundancy * adress review * small language fix
…kubeflow#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.
…kubeflow#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.
…kubeflow#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.
This is code input for doc about writing Keras for tfjob.
There are few todos:
for this
Fixes #196
This change is