kubeflow · k8s-ci-robot · Aug 25, 2018 · Jul 27, 2018 · Jul 27, 2018 · Aug 13, 2018
diff --git a/.gitignore b/.gitignore
@@ -49,4 +49,11 @@ github_issue_summarization/build/
 
 # Don't check in the go vendor directory
 # We can just use the dep tool to install them.
-github_issue_summarization/hp-tune/vendor
+github_issue_summarization/hp-tune/vendor
+
+# Vim
+[._]*.s[a-v][a-z]
+[._]*.sw[a-p]
+[._]s[a-rt-v][a-z]
+[._]ss[a-gi-z]
+[._]sw[a-p]
diff --git a/github_issue_summarization/02_distributed_training.md b/github_issue_summarization/02_distributed_training.md
@@ -0,0 +1,74 @@
+# Distributed training using Estimator
+
+Requires Tensorflow 1.9 or later.
+Requires [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) capable of creating ReadWriteMany persistent volumes.
+
+Estimator and Keras are both part of Tensorflow. These high level APIs are designed
+to make building models easier. In our distributed training example we will show how both
+APIs work together to help build models that will be trainable in both single node and
+distributed manner.
+
+## Keras and Estimators
+
+Code required to run this example can be found in [distributed](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/distributed) directory.
+
+You can read more about Estimators [here](https://www.tensorflow.org/guide/estimators).
+In our example we will leverage `model_to_estimator` function that allows to turn existing tf.keras model to estimator, and therefore allow it to
+seamlessly be trained distributed and integrate with `TF_CONFIG` variable which is generated as part of TFJob.
+
+## How to run it
+
+Assuming you have already setup your Kubeflow cluster, all you need to do to try it out:
+
+```
+kubectl create -f distributed/tfjob.yaml
+```
+
+## What just happened?
+
+With command above we have created Custom Resource that has been defined during Kubeflow
+installation, namely `TFJob`.
+
+If you look at [tfjob.yaml](https://github.com/kubeflow/examples/blob/master/github_issue_summarization/distributed/tfjob.yaml) few things are worth mentioning.
+
+1. We create PVC for data and working directory for our TFJob where models will be saved at the end.
+2. Next we run download [Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) to download dataset and save it on one of PVCs
+3. Run our TFJob
+
+## Understanding TFJob
+
+Each TFJob will run 3 types of Pods.
+
+Master should always have 1 replica. This is main worker which will show us status of overall job.
+
+PS, or Parameter server, is Pod that will hold all weights. It can have any number of replicas, recommended to have more than 1 for high availability.
+
+Worker is Pod which will run training. It can have any number of replicas.
+
+Refer to [Pod definition](https://kubernetes.io/docs/concepts/workloads/pods/pod/) documentation for details.
+TFJob differs slightly from regular Pod in a way that it will generate `TF_CONFIG` environmental variable in each Pod.
+This variable is then consumed by Estimator and used to orchestrate distributed training.
+
+## Understanding training code
+
+There are few things required for this approach to work.
+
+First we need to parse clustering variable. This is required to run different logic per node role
+
+1. If node is PS - run server.join()
+2. If node is Master - run feature preparation and parse input dataset
+
+After that we define Keras model. Please refer to [tf.keras documentation](https://www.tensorflow.org/guide/keras).
+
+Finally we use `tf.keras.estimator.model_to_estimator` function to enable distributed training on this model.
+
+## Input function
+
+Estimators use [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) as it's default distribution model.
+For that reason we need to prepare function that will slice input data to batches, which are then run on each worker.
+Tensorflow provides several utility functions to help with that, and because we use numpy array as our input, `tf.estimator.inputs.numpy_input_fn` is perfect
+tool for us. Please refer to [documentation](https://www.tensorflow.org/guide/premade_estimators#create_input_functions) for more information.
+
+## Model
+
+After training is complete, our model can be found in "model" PVC.
diff --git a/github_issue_summarization/02_tensor2tensor_training.md b/github_issue_summarization/02_tensor2tensor_training.md
diff --git a/github_issue_summarization/README.md b/github_issue_summarization/README.md
@@ -30,7 +30,7 @@ By the end of this tutorial, you should learn how to:
     methods using Jupyter Notebook or using TFJob:
     -  [Training the model using a Jupyter Notebook](02_training_the_model.md)
     -  [Training the model using TFJob](02_training_the_model_tfjob.md)
-    -  [Distributed Training using tensor2tensor and TFJob](02_tensor2tensor_training.md)
+    -  [Distributed Training using estimator and TFJob](02_distributed_training.md)
 1.  [Serving the model](03_serving_the_model.md)
 1.  [Querying the model](04_querying_the_model.md)
 1.  [Teardown](05_teardown.md)
diff --git a/github_issue_summarization/distributed/Dockerfile b/github_issue_summarization/distributed/Dockerfile
@@ -0,0 +1,14 @@
+FROM python:3.6
+
+RUN pip install --upgrade ktext annoy sklearn nltk tensorflow
+RUN pip install --upgrade matplotlib ipdb
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install unzip
+RUN mkdir /issues
+WORKDIR /issues
+COPY . /issues
+RUN chmod +x /issues/download.sh
+RUN mkdir /model
+RUN mkdir /data
+
+CMD python train.py
diff --git a/github_issue_summarization/distributed/download.sh b/github_issue_summarization/distributed/download.sh
@@ -0,0 +1,9 @@
+#!/usr/bin/env bash
+
+export DATA_DIR="/data"
+
+# Download the github-issues.zip training data to /mnt/github-issues-data
+wget --directory-prefix=${DATA_DIR} https://storage.googleapis.com/kubeflow-examples/github-issue-summarization-data/github-issues.zip
+
+# Unzip the file into /mnt/github-issues-data directory
+unzip ${DATA_DIR}/github-issues.zip -d ${DATA_DIR}