Add estimator example for github issues (kubeflow#203)

* Add estimator example for github issues This is code input for doc about writing Keras for tfjob. There are few todos: 1. bug in dataset injection, can't raise number of steps 2. intead of adding hostpath for data, we should have quick job + pvc for this * pyling * wip * confirmed working on minikube * pylint * remove t2t, add documentation * add note about storageclass * fix link * remove code redundancy * adress review * small language fix
yixinshi · Nov 30, 2018 · 78cd5c5 · 78cd5c5
1 parent 21fe841
commit 78cd5c5
Show file tree

Hide file tree

Showing 13 changed files with 414 additions and 257 deletions.
diff --git a/.gitignore b/.gitignore
@@ -49,4 +49,11 @@ github_issue_summarization/build/
 
 # Don't check in the go vendor directory
 # We can just use the dep tool to install them.
-github_issue_summarization/hp-tune/vendor
+github_issue_summarization/hp-tune/vendor
+
+# Vim
+[._]*.s[a-v][a-z]
+[._]*.sw[a-p]
+[._]s[a-rt-v][a-z]
+[._]ss[a-gi-z]
+[._]sw[a-p]
diff --git a/github_issue_summarization/02_distributed_training.md b/github_issue_summarization/02_distributed_training.md
@@ -0,0 +1,90 @@
+# Distributed training using Estimator
+
+Requires Tensorflow 1.9 or later.
+Requires [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) capable of creating ReadWriteMany persistent volumes.
+
+On GKE you can follow [GCFS documentation](https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow) to enable it.
+
+Estimator and Keras are both part of Tensorflow. These high level APIs are designed
+to make building models easier. In our distributed training example we will show how both
+APIs work together to help build models that will be trainable in both single node and
+distributed manner.
+
+## Keras and Estimators
+
+Code required to run this example can be found in [distributed](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/distributed) directory.
+
+You can read more about Estimators [here](https://www.tensorflow.org/guide/estimators).
+In our example we will leverage `model_to_estimator` function that allows to turn existing tf.keras model to estimator, and therefore allow it to
+seamlessly be trained distributed and integrate with `TF_CONFIG` variable which is generated as part of TFJob.
+
+## How to run it
+
+First, create PVC and download data. It would be good at this point to ensure that PVC use correct StorageClass etc.
+
+```
+kubectl create -f distributed/storage.yaml
+```
+
+Once download job finishes, you can run training by:
+
+```
+kubectl create -f distributed/tfjob.yaml
+```
+
+## Building image
+
+To build image run:
+
+```
+docker build . -f distributed/Dockerfile
+```
+
+## What just happened?
+
+With command above we have created Custom Resource that has been defined during Kubeflow
+installation, namely `TFJob`.
+
+If you look at [tfjob.yaml](https://github.com/kubeflow/examples/blob/master/github_issue_summarization/distributed/tfjob.yaml) few things are worth mentioning.
+
+1. We create PVC for data and working directory for our TFJob where models will be saved at the end.
+2. Next we run download [Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) to download dataset and save it on one of PVCs
+3. Run our TFJob
+
+## Understanding TFJob
+
+Each TFJob will run 3 types of Pods.
+
+Master should always have 1 replica. This is main worker which will show us status of overall job.
+
+PS, or Parameter server, is Pod that will hold all weights. It can have any number of replicas, recommended to have more than 1 - load will be spread between replicas, which would increase performance for io-bound training.
+
+Worker is Pod which will run training. It can have any number of replicas.
+
+Refer to [Pod definition](https://kubernetes.io/docs/concepts/workloads/pods/pod/) documentation for details.
+TFJob differs slightly from regular Pod in a way that it will generate `TF_CONFIG` environmental variable in each Pod.
+This variable is then consumed by Estimator and used to orchestrate distributed training.
+
+## Understanding training code
+
+There are few things required for this approach to work.
+
+First we need to parse TF_CONFIG variable. This is required to run different logic per node role
+
+1. If node is PS - run server.join()
+2. If node is Master - run feature preparation and parse input dataset
+
+After that we define Keras model. Please refer to [tf.keras documentation](https://www.tensorflow.org/guide/keras).
+
+Finally we use `tf.keras.estimator.model_to_estimator` function to enable distributed training on this model.
+
+## Input function
+
+Estimators use [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) as it's default distribution model.
+For that reason we need to prepare function that will slice input data to batches, which are then run on each worker.
+Tensorflow provides several utility functions to help with that, and because we use numpy array as our input, `tf.estimator.inputs.numpy_input_fn` is perfect
+tool for us. Please refer to [documentation](https://www.tensorflow.org/guide/premade_estimators#create_input_functions) for more information.
+
+## Model
+
+After training is complete, our model can be found in "model" PVC.
diff --git a/github_issue_summarization/02_tensor2tensor_training.md b/github_issue_summarization/02_tensor2tensor_training.md
diff --git a/github_issue_summarization/README.md b/github_issue_summarization/README.md
@@ -30,7 +30,7 @@ By the end of this tutorial, you should learn how to:
     methods using Jupyter Notebook or using TFJob:
     -  [Training the model using a Jupyter Notebook](02_training_the_model.md)
     -  [Training the model using TFJob](02_training_the_model_tfjob.md)
-    -  [Distributed Training using tensor2tensor and TFJob](02_tensor2tensor_training.md)
+    -  [Distributed Training using estimator and TFJob](02_distributed_training.md)
 1.  [Serving the model](03_serving_the_model.md)
 1.  [Querying the model](04_querying_the_model.md)
 1.  [Teardown](05_teardown.md)
diff --git a/github_issue_summarization/distributed/Dockerfile b/github_issue_summarization/distributed/Dockerfile
@@ -0,0 +1,16 @@
+FROM python:3.6
+
+RUN pip install --upgrade ktext annoy sklearn nltk tensorflow
+RUN pip install --upgrade matplotlib ipdb
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install unzip
+RUN mkdir /issues
+WORKDIR /issues
+COPY distributed /issues
+COPY notebooks/seq2seq_utils.py /issues
+COPY ks-kubeflow/components/download_data.sh /issues
+RUN chmod +x /issues/download_data.sh
+RUN mkdir /model
+RUN mkdir /data
+
+CMD python train.py
diff --git a/github_issue_summarization/distributed/storage.yaml b/github_issue_summarization/distributed/storage.yaml
@@ -0,0 +1,53 @@
+# You will need NFS storage class, or any other ReadWriteMany storageclass
+# Quick and easy way to get it is https://github.com/helm/charts/tree/master/stable/nfs-server-provisioner
+# For GKE you can use GCFS https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow
+
+---
+kind: PersistentVolumeClaim
+apiVersion: v0
+metadata:
+  name: models
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 49Gi
+
+---
+kind: PersistentVolumeClaim
+apiVersion: v0
+metadata:
+  name: data
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 49Gi
+---
+apiVersion: batch/v0
+kind: Job
+metadata:
+  name: download
+spec:
+  template:
+    spec:
+      containers:
+      - name: tensorflow
+        image: inc-1/issues
+        command: ["/issues/download_data.sh", "https://storage.googleapis.com/kubeflow-examples/github-issue-summarization-data/github-issues.zip", "/data"]
+        volumeMounts:
+        - name: data
+          mountPath: "/data"
+        - name: models
+          mountPath: "/model"
+      volumes:
+        - name: data
+          persistentVolumeClaim:
+            claimName: data
+        - name: models
+          persistentVolumeClaim:
+            claimName: models
+      restartPolicy: Never
+  backoffLimit: 3
diff --git a/github_issue_summarization/distributed/tfjob.yaml b/github_issue_summarization/distributed/tfjob.yaml
@@ -0,0 +1,69 @@
+---
+apiVersion: "kubeflow.org/v1alpha2"
+kind: TFJob
+metadata:
+  name: github
+spec:
+  tfReplicaSpecs:
+    Master:
+      replicas: 1
+      template:
+        spec:
+          containers:
+          - name: tensorflow
+            image: inc0/issues
+            command: ["python", "/issues/train.py"]
+            volumeMounts:
+            - name: data
+              mountPath: "/data"
+            - name: models
+              mountPath: "/model"
+          volumes:
+            - name: data
+              persistentVolumeClaim:
+                claimName: data
+            - name: models
+              persistentVolumeClaim:
+                claimName: models
+    Worker:
+      replicas: 5
+      template:
+        spec:
+          containers:
+          - name: tensorflow
+            image: inc0/issues
+            command: ["python", "/issues/train.py"]
+            volumeMounts:
+            - name: data
+              mountPath: "/data"
+            - name: models
+              mountPath: "/model"
+          volumes:
+            - name: data
+              persistentVolumeClaim:
+                claimName: data
+            - name: models
+              persistentVolumeClaim:
+                claimName: models
+    PS:
+      replicas: 3
+      template:
+        spec:
+          containers:
+          - name: tensorflow
+            image: inc0/issues
+            command: ["python", "/issues/train.py"]
+            volumeMounts:
+            - name: data
+              mountPath: "/data"
+            - name: models
+              mountPath: "/model"
+            ports:
+            - containerPort: 6006
+          volumes:
+            - name: data
+              persistentVolumeClaim:
+                claimName: data
+            - name: models
+              persistentVolumeClaim:
+                claimName: models