forked from kubeflow/examples
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add estimator example for github issues (kubeflow#203)
* Add estimator example for github issues This is code input for doc about writing Keras for tfjob. There are few todos: 1. bug in dataset injection, can't raise number of steps 2. intead of adding hostpath for data, we should have quick job + pvc for this * pyling * wip * confirmed working on minikube * pylint * remove t2t, add documentation * add note about storageclass * fix link * remove code redundancy * adress review * small language fix
- Loading branch information
Michał Jastrzębski
authored and
Yixin Shi
committed
Nov 30, 2018
1 parent
21fe841
commit 78cd5c5
Showing
13 changed files
with
414 additions
and
257 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
# Distributed training using Estimator | ||
|
||
Requires Tensorflow 1.9 or later. | ||
Requires [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) capable of creating ReadWriteMany persistent volumes. | ||
|
||
On GKE you can follow [GCFS documentation](https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow) to enable it. | ||
|
||
Estimator and Keras are both part of Tensorflow. These high level APIs are designed | ||
to make building models easier. In our distributed training example we will show how both | ||
APIs work together to help build models that will be trainable in both single node and | ||
distributed manner. | ||
|
||
## Keras and Estimators | ||
|
||
Code required to run this example can be found in [distributed](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/distributed) directory. | ||
|
||
You can read more about Estimators [here](https://www.tensorflow.org/guide/estimators). | ||
In our example we will leverage `model_to_estimator` function that allows to turn existing tf.keras model to estimator, and therefore allow it to | ||
seamlessly be trained distributed and integrate with `TF_CONFIG` variable which is generated as part of TFJob. | ||
|
||
## How to run it | ||
|
||
First, create PVC and download data. It would be good at this point to ensure that PVC use correct StorageClass etc. | ||
|
||
``` | ||
kubectl create -f distributed/storage.yaml | ||
``` | ||
|
||
Once download job finishes, you can run training by: | ||
|
||
``` | ||
kubectl create -f distributed/tfjob.yaml | ||
``` | ||
|
||
## Building image | ||
|
||
To build image run: | ||
|
||
``` | ||
docker build . -f distributed/Dockerfile | ||
``` | ||
|
||
## What just happened? | ||
|
||
With command above we have created Custom Resource that has been defined during Kubeflow | ||
installation, namely `TFJob`. | ||
|
||
If you look at [tfjob.yaml](https://github.com/kubeflow/examples/blob/master/github_issue_summarization/distributed/tfjob.yaml) few things are worth mentioning. | ||
|
||
1. We create PVC for data and working directory for our TFJob where models will be saved at the end. | ||
2. Next we run download [Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) to download dataset and save it on one of PVCs | ||
3. Run our TFJob | ||
|
||
## Understanding TFJob | ||
|
||
Each TFJob will run 3 types of Pods. | ||
|
||
Master should always have 1 replica. This is main worker which will show us status of overall job. | ||
|
||
PS, or Parameter server, is Pod that will hold all weights. It can have any number of replicas, recommended to have more than 1 - load will be spread between replicas, which would increase performance for io-bound training. | ||
|
||
Worker is Pod which will run training. It can have any number of replicas. | ||
|
||
Refer to [Pod definition](https://kubernetes.io/docs/concepts/workloads/pods/pod/) documentation for details. | ||
TFJob differs slightly from regular Pod in a way that it will generate `TF_CONFIG` environmental variable in each Pod. | ||
This variable is then consumed by Estimator and used to orchestrate distributed training. | ||
|
||
## Understanding training code | ||
|
||
There are few things required for this approach to work. | ||
|
||
First we need to parse TF_CONFIG variable. This is required to run different logic per node role | ||
|
||
1. If node is PS - run server.join() | ||
2. If node is Master - run feature preparation and parse input dataset | ||
|
||
After that we define Keras model. Please refer to [tf.keras documentation](https://www.tensorflow.org/guide/keras). | ||
|
||
Finally we use `tf.keras.estimator.model_to_estimator` function to enable distributed training on this model. | ||
|
||
## Input function | ||
|
||
Estimators use [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism) as it's default distribution model. | ||
For that reason we need to prepare function that will slice input data to batches, which are then run on each worker. | ||
Tensorflow provides several utility functions to help with that, and because we use numpy array as our input, `tf.estimator.inputs.numpy_input_fn` is perfect | ||
tool for us. Please refer to [documentation](https://www.tensorflow.org/guide/premade_estimators#create_input_functions) for more information. | ||
|
||
## Model | ||
|
||
After training is complete, our model can be found in "model" PVC. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
FROM python:3.6 | ||
|
||
RUN pip install --upgrade ktext annoy sklearn nltk tensorflow | ||
RUN pip install --upgrade matplotlib ipdb | ||
ENV DEBIAN_FRONTEND=noninteractive | ||
RUN apt-get update && apt-get install unzip | ||
RUN mkdir /issues | ||
WORKDIR /issues | ||
COPY distributed /issues | ||
COPY notebooks/seq2seq_utils.py /issues | ||
COPY ks-kubeflow/components/download_data.sh /issues | ||
RUN chmod +x /issues/download_data.sh | ||
RUN mkdir /model | ||
RUN mkdir /data | ||
|
||
CMD python train.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# You will need NFS storage class, or any other ReadWriteMany storageclass | ||
# Quick and easy way to get it is https://github.com/helm/charts/tree/master/stable/nfs-server-provisioner | ||
# For GKE you can use GCFS https://master.kubeflow.org/docs/started/getting-started-gke/#using-gcfs-with-kubeflow | ||
|
||
--- | ||
kind: PersistentVolumeClaim | ||
apiVersion: v0 | ||
metadata: | ||
name: models | ||
spec: | ||
accessModes: | ||
- ReadWriteMany | ||
resources: | ||
requests: | ||
storage: 49Gi | ||
|
||
--- | ||
kind: PersistentVolumeClaim | ||
apiVersion: v0 | ||
metadata: | ||
name: data | ||
spec: | ||
accessModes: | ||
- ReadWriteMany | ||
resources: | ||
requests: | ||
storage: 49Gi | ||
--- | ||
apiVersion: batch/v0 | ||
kind: Job | ||
metadata: | ||
name: download | ||
spec: | ||
template: | ||
spec: | ||
containers: | ||
- name: tensorflow | ||
image: inc-1/issues | ||
command: ["/issues/download_data.sh", "https://storage.googleapis.com/kubeflow-examples/github-issue-summarization-data/github-issues.zip", "/data"] | ||
volumeMounts: | ||
- name: data | ||
mountPath: "/data" | ||
- name: models | ||
mountPath: "/model" | ||
volumes: | ||
- name: data | ||
persistentVolumeClaim: | ||
claimName: data | ||
- name: models | ||
persistentVolumeClaim: | ||
claimName: models | ||
restartPolicy: Never | ||
backoffLimit: 3 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
--- | ||
apiVersion: "kubeflow.org/v1alpha2" | ||
kind: TFJob | ||
metadata: | ||
name: github | ||
spec: | ||
tfReplicaSpecs: | ||
Master: | ||
replicas: 1 | ||
template: | ||
spec: | ||
containers: | ||
- name: tensorflow | ||
image: inc0/issues | ||
command: ["python", "/issues/train.py"] | ||
volumeMounts: | ||
- name: data | ||
mountPath: "/data" | ||
- name: models | ||
mountPath: "/model" | ||
volumes: | ||
- name: data | ||
persistentVolumeClaim: | ||
claimName: data | ||
- name: models | ||
persistentVolumeClaim: | ||
claimName: models | ||
Worker: | ||
replicas: 5 | ||
template: | ||
spec: | ||
containers: | ||
- name: tensorflow | ||
image: inc0/issues | ||
command: ["python", "/issues/train.py"] | ||
volumeMounts: | ||
- name: data | ||
mountPath: "/data" | ||
- name: models | ||
mountPath: "/model" | ||
volumes: | ||
- name: data | ||
persistentVolumeClaim: | ||
claimName: data | ||
- name: models | ||
persistentVolumeClaim: | ||
claimName: models | ||
PS: | ||
replicas: 3 | ||
template: | ||
spec: | ||
containers: | ||
- name: tensorflow | ||
image: inc0/issues | ||
command: ["python", "/issues/train.py"] | ||
volumeMounts: | ||
- name: data | ||
mountPath: "/data" | ||
- name: models | ||
mountPath: "/model" | ||
ports: | ||
- containerPort: 6006 | ||
volumes: | ||
- name: data | ||
persistentVolumeClaim: | ||
claimName: data | ||
- name: models | ||
persistentVolumeClaim: | ||
claimName: models |
Oops, something went wrong.