Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add estimator example for github issues #203

Merged
merged 11 commits into from
Aug 25, 2018
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,11 @@ github_issue_summarization/build/

# Don't check in the go vendor directory
# We can just use the dep tool to install them.
github_issue_summarization/hp-tune/vendor
github_issue_summarization/hp-tune/vendor

# Vim
[._]*.s[a-v][a-z]
[._]*.sw[a-p]
[._]s[a-rt-v][a-z]
[._]ss[a-gi-z]
[._]sw[a-p]
74 changes: 74 additions & 0 deletions github_issue_summarization/02_distributed_training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Distributed training using Estimator

Requires Tensorflow 1.9 or later.
Requires [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) capable of creating ReadWriteMany persistent volumes.

Estimator and Keras are both part of Tensorflow. These high level APIs are designed
to make building models easier. In our distributed training example we will show how both
APIs work together to help build models that will be trainable in both single node and
distributed manner.

## Keras and Estimators

Code required to run this example can be found in [distributed](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/distributed) directory.

You can read more about Estimators [here](https://www.tensorflow.org/guide/estimators).
In our example we will leverage `model_to_estimator` function that allows to turn existing tf.keras model to estimator, and therefore allow it to
seamlessly be trained distributed and integrate with `TF_CONFIG` variable which is generated as part of TFJob.

## How to run it

Assuming you have already setup your Kubeflow cluster, all you need to do to try it out:

```
kubectl create -f distributed/tfjob.yaml
```

## What just happened?

With command above we have created Custom Resource that has been defined during Kubeflow
installation, namely `TFJob`.

If you look at [tfjob.yaml](https://github.com/kubeflow/examples/blob/master/github_issue_summarization/distributed/tfjob.yaml) few things are worth mentioning.

1. We create PVC for data and working directory for our TFJob where models will be saved at the end.
2. Next we run download [Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) to download dataset and save it on one of PVCs
3. Run our TFJob

## Understanding TFJob

Each TFJob will run 3 types of Pods.

Master should always have 1 replica. This is main worker which will show us status of overall job.

PS, or Parameter server, is Pod that will hold all weights. It can have any number of replicas, recommended to have more than 1 for high availability.

Worker is Pod which will run training. It can have any number of replicas.

Refer to [Pod definition](https://kubernetes.io/docs/concepts/workloads/pods/pod/) documentation for details.
TFJob differs slightly from regular Pod in a way that it will generate `TF_CONFIG` environmental variable in each Pod.
This variable is then consumed by Estimator and used to orchestrate distributed training.

## Understanding training code

There are few things required for this approach to work.

First we need to parse clustering variable. This is required to run different logic per node role

1. If node is PS - run server.join()
2. If node is Master - run feature preparation and parse input dataset

After that we define Keras model. Please refer to [tf.keras documentation](https://www.tensorflow.org/guide/keras).

Finally we use `tf.keras.estimator.model_to_estimator` function to enable distributed training on this model.

## Input function

Estimators use [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) as it's default distribution model.
For that reason we need to prepare function that will slice input data to batches, which are then run on each worker.
Tensorflow provides several utility functions to help with that, and because we use numpy array as our input, `tf.estimator.inputs.numpy_input_fn` is perfect
tool for us. Please refer to [documentation](https://www.tensorflow.org/guide/premade_estimators#create_input_functions) for more information.

## Model

After training is complete, our model can be found in "model" PVC.
93 changes: 0 additions & 93 deletions github_issue_summarization/02_tensor2tensor_training.md

This file was deleted.

2 changes: 1 addition & 1 deletion github_issue_summarization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ By the end of this tutorial, you should learn how to:
methods using Jupyter Notebook or using TFJob:
- [Training the model using a Jupyter Notebook](02_training_the_model.md)
- [Training the model using TFJob](02_training_the_model_tfjob.md)
- [Distributed Training using tensor2tensor and TFJob](02_tensor2tensor_training.md)
- [Distributed Training using estimator and TFJob](02_distributed_training.md)
1. [Serving the model](03_serving_the_model.md)
1. [Querying the model](04_querying_the_model.md)
1. [Teardown](05_teardown.md)
14 changes: 14 additions & 0 deletions github_issue_summarization/distributed/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM python:3.6

RUN pip install --upgrade ktext annoy sklearn nltk tensorflow
RUN pip install --upgrade matplotlib ipdb
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install unzip
RUN mkdir /issues
WORKDIR /issues
COPY . /issues
RUN chmod +x /issues/download.sh
RUN mkdir /model
RUN mkdir /data

CMD python train.py
9 changes: 9 additions & 0 deletions github_issue_summarization/distributed/download.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/usr/bin/env bash

export DATA_DIR="/data"

# Download the github-issues.zip training data to /mnt/github-issues-data
wget --directory-prefix=${DATA_DIR} https://storage.googleapis.com/kubeflow-examples/github-issue-summarization-data/github-issues.zip

# Unzip the file into /mnt/github-issues-data directory
unzip ${DATA_DIR}/github-issues.zip -d ${DATA_DIR}
Loading