Skip to content

Latest commit

 

History

History
102 lines (64 loc) · 4.38 KB

02_distributed_training.md

File metadata and controls

102 lines (64 loc) · 4.38 KB

Distributed training using Estimator

Distributed training with Keras currently does not work. Do not follow this guide until these issues have been resolved:

Requires TensorFlow 1.9 or later. Requires StorageClass capable of creating ReadWriteMany persistent volumes.

On GKE you can follow GCFS documentation to enable it.

Estimator and Keras are both part of TensorFlow. These high-level APIs are designed to make building models easier. In our distributed training example, we will show how both APIs work together to help build models that will be trainable in both single node and distributed manner.

Keras and Estimators

Code required to run this example can be found in the distributed directory.

You can read more about Estimators here. In our example we will leverage model_to_estimator function that allows to turn existing tf.keras model to estimator, and therefore allow it to seamlessly be trained distributed and integrate with TF_CONFIG variable which is generated as part of TFJob.

How to run it

First, create PVC and download data. It would be good at this point to ensure that PVC use correct StorageClass etc.

kubectl create -f distributed/storage.yaml

Once download job finishes, you can run training by:

kubectl create -f distributed/tfjob.yaml

Building image

To build image run:

docker build . -f distributed/Dockerfile

What just happened?

With command above we have created Custom Resource that has been defined during Kubeflow installation, namely TFJob.

If you look at tfjob.yaml few things are worth mentioning.

  1. We create PVC for data and working directory for our TFJob where models will be saved at the end.
  2. Next we run download Job to download dataset and save it on one of PVCs
  3. Run our TFJob

Understanding TFJob

Each TFJob will run 3 types of Pods.

Master should always have 1 replica. This is main worker which will show us status of overall job.

PS, or Parameter server, is Pod that will hold all weights. It can have any number of replicas, recommended to have more than 1 - load will be spread between replicas, which would increase performance for io-bound training.

Worker is Pod which will run training. It can have any number of replicas.

Refer to Pod definition documentation for details. TFJob differs slightly from regular Pod in a way that it will generate TF_CONFIG environmental variable in each Pod. This variable is then consumed by Estimator and used to orchestrate distributed training.

Understanding training code

There are few things required for this approach to work.

First we need to parse TF_CONFIG variable. This is required to run different logic per node role

  1. If node is PS - run server.join()
  2. If node is Master - run feature preparation and parse input dataset

After that we define Keras model. Please refer to tf.keras documentation.

Finally we use tf.keras.estimator.model_to_estimator function to enable distributed training on this model.

Input function

Estimators use data parallelism as it's default distribution model. For that reason we need to prepare function that will slice input data to batches, which are then run on each worker. Tensorflow provides several utility functions to help with that, and because we use numpy array as our input, tf.estimator.inputs.numpy_input_fn is perfect tool for us. Please refer to documentation for more information.

Model

After training is complete, our model can be found in "model" PVC.

Next: Serving the model

Back: Setup a kubeflow cluster