Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gh_issue_summarization] distributed training using Keras #196

Closed
jlewi opened this issue Jul 24, 2018 · 6 comments
Closed

[gh_issue_summarization] distributed training using Keras #196

jlewi opened this issue Jul 24, 2018 · 6 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 24, 2018

When we previously tried using distributed training with the Keras Seq2Seq model (#43)

We ran into the error in this comment

InvalidArgumentError (see above for traceback): Cannot colocate nodes 'Decoder-Word-Embedding/embeddings' and 'training/Nadam/gradients/Decoder-Word-Embedding/Gather_grad/Shape: Cannot merge devices with incompatible jobs: '/job:worker/task:0' and '/job:ps/task:0'
   [[Node: Decoder-Word-Embedding/embeddings = VariableV2[container="", dtype=DT_FLOAT, shape=[4278,300], shared_name="", _device="/job:ps/task:0"]()]]

using this code
https://gist.github.com/ankushagarwal/a2dab3aaf664a4296292c4ff330fb5e6

This is supposedly fixed with TensorFlow 1.9 and tf.Keras.

This would be worth retrying to see if we can make distributed training work with the Keras model.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 24, 2018

@inc0 You mentioned you were looking at distributed training on this example. Any interest in trying this out to see if the problem is fixed?

@inc0
Copy link

inc0 commented Jul 24, 2018

@jlewi yeah. My best bet is to use tf.keras.model_to_estimator and enable https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/python/multi_worker_strategy.py - distributed strategy that would take ClusterSpec as input and be capable of data-parallel distribute training. That, if I'm correct, would allow us to write boilerplate code that would enable any tf.keras model to be trained with tf-job.

@inc0
Copy link

inc0 commented Jul 24, 2018

/assign inc0

@jlewi
Copy link
Contributor Author

jlewi commented Oct 24, 2018

#203 doesn't appear to properly train a model so reopening this bug.

If you look at the original PR there was a fix me
https://github.com/kubeflow/examples/pull/203/files#diff-400db61475ce06583dc398d366335851R143

Indicating that the code is hitting:
keras-team/keras#9761

#203 seems to have worked around that by dropping a layer and thus computing non-sensical predications.

#265 attempt to fix that by matching the original code in terms of layer construction. #265 includes a test which reproduces the error.
#265 (comment)

In #265 we tried to train a valid model using Keras' fit function but that turns out to have problems (#280) if we use the Keras that ships with TF.

So in #265 we reverted to not using the Keras that ships with TF so that we could train a model.

jlewi added a commit to jlewi/examples that referenced this issue Nov 6, 2018
Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.
k8s-ci-robot pushed a commit that referenced this issue Nov 8, 2018
…#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See #196
The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
@amygdala
Copy link
Collaborator

Just a note that model_to_estimator() doesn't work b/c of passing encoder state to the decoder initial_state (that is, it doesn't work b/c initial_state is set).
I believe the underlying issue is that such a model isn't guaranteed to be serializable.

yixinshi pushed a commit to yixinshi/examples that referenced this issue Nov 30, 2018
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this issue Dec 6, 2018
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this issue Apr 1, 2019
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
@stale
Copy link

stale bot commented Jun 27, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants