[Sample] CI Sample: Kaggle (#3021)

* kaggle sample * code path * fix typo * visualize table component * visualize html * train model step * submit result * real image * fix typo * push before use * sed to replace image in component.yaml * general instructions * typos; more robust; better code style * notice about gcp sa and workload identity choice
kubeflow · Mar 27, 2020 · 081ee74 · 081ee74
1 parent 14a56ba
commit 081ee74
Show file tree

Hide file tree

Showing 19 changed files with 745 additions and 0 deletions.
diff --git a/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/README.md b/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/README.md
@@ -0,0 +1,24 @@
+# Kaggle Competition Pipeline Sample
+
+## Pipeline Overview
+
+This is a pipeline for [house price prediction](https://www.kaggle.com/c/house-prices-advanced-regression-techniques), an entry-level competition in kaggle. We demonstrate how to complete a kaggle competition by creating a pipeline of steps including downloading data, preprocessing and visualizing data, train model and submitting results to kaggle website. 
+
+* We refer to [the notebook by Raj Kumar Gupta](https://www.kaggle.com/rajgupta5/house-price-prediction) and [the notebook by Sergei Neviadomski](https://www.kaggle.com/neviadomski/how-to-get-to-top-25-with-simple-model-sklearn) in terms of model implementation as well as data visualization. 
+
+* We use [kaggle python api](https://github.com/Kaggle/kaggle-api) to interact with kaggle site, such as downloading data and submiting result. More usage can be found in their documentation.
+
+* We use [cloud build](https://cloud.google.com/cloud-build/) for CI process. That is, we automatically triggered a build and run as soon as we pushed our code to github repo. You need to setup a trigger on cloud build for your github repo branch to achieve the CI process.
+
+## Notice
+* You can authenticate to gcp services by either: Create a "user-gcp-sa" secret following the troubleshooting parts in [Kubeflow pipeline repo](https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize), or configure workload identity as instructed in [this guide](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity). This sample uses the first method, but this will soon be deprecated. We would recommend using second method to replace the use of "user-gcp-sa" service account in the future.
+
+## Usage
+
+* Substitute the constants in "substitutions" in cloudbuild.yaml
+* Fill in your kaggle_username and kaggle_key in Dockerfiles(in the folder "download_dataset" and "submit_result"） to authenticate to kaggle. You can get them from an API token created from your kaggle "My Account" page.
+* Set up cloud build triggers to your github repo for Continuous Integration
+* Replace the CLOUDSDK_COMPUTE_ZONE, CLOUDSDK_CONTAINER_CLUSTER in cloudbuild.yaml with your own zone and cluster
+* Enable "Kubernetes Engine Developer" in cloud build setting
+* Set your gs bucket public or grant cloud storage access to cloud build and kubeflow pipeline
+* Try commit and push it to github repo
diff --git a/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/cloudbuild.yaml b/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/cloudbuild.yaml
@@ -0,0 +1,183 @@
+steps:
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "build",
+        "-t",
+        "${_GCR_PATH}/kaggle_download:$COMMIT_SHA",
+        "-t",
+        "${_GCR_PATH}/kaggle_download:latest",
+        "${_CODE_PATH}/download_dataset",
+        "-f",
+        "${_CODE_PATH}/download_dataset/Dockerfile",
+      ]
+    id: "BuildDownloadDataImage"
+
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "push",
+        "${_GCR_PATH}/kaggle_download:$COMMIT_SHA",
+      ]
+    id: "PushDownloadDataImage"
+    waitFor: ["BuildDownloadDataImage"]
+
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "build",
+        "-t",
+        "${_GCR_PATH}/kaggle_visualize_table:$COMMIT_SHA",
+        "-t",
+        "${_GCR_PATH}/kaggle_visualize_table:latest",
+        "${_CODE_PATH}/visualize_table",
+        "-f",
+        "${_CODE_PATH}/visualize_table/Dockerfile",
+      ]
+    id: "BuildVisualizeTableImage"
+
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "push",
+        "${_GCR_PATH}/kaggle_visualize_table:$COMMIT_SHA",
+      ]
+    id: "PushVisualizeTableImage"
+    waitFor: ["BuildVisualizeTableImage"]
+
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "build",
+        "-t",
+        "${_GCR_PATH}/kaggle_visualize_html:$COMMIT_SHA",
+        "-t",
+        "${_GCR_PATH}/kaggle_visualize_html:latest",
+        "${_CODE_PATH}/visualize_html",
+        "-f",
+        "${_CODE_PATH}/visualize_html/Dockerfile",
+      ]
+    id: "BuildVisualizeHTMLImage"
+
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "push",
+        "${_GCR_PATH}/kaggle_visualize_html:$COMMIT_SHA",
+      ]
+    id: "PushVisualizeHTMLImage"
+    waitFor: ["BuildVisualizeHTMLImage"]
+
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "build",
+        "-t",
+        "${_GCR_PATH}/kaggle_train:$COMMIT_SHA",
+        "-t",
+        "${_GCR_PATH}/kaggle_train:latest",
+        "${_CODE_PATH}/train_model",
+        "-f",
+        "${_CODE_PATH}/train_model/Dockerfile",
+      ]
+    id: "BuildTrainImage"
+
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "push",
+        "${_GCR_PATH}/kaggle_train:$COMMIT_SHA",
+      ]
+    id: "PushTrainImage"
+    waitFor: ["BuildTrainImage"]
+
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "build",
+        "-t",
+        "${_GCR_PATH}/kaggle_submit:$COMMIT_SHA",
+        "-t",
+        "${_GCR_PATH}/kaggle_submit:latest",
+        "${_CODE_PATH}/submit_result",
+        "-f",
+        "${_CODE_PATH}/submit_result/Dockerfile",
+      ]
+    id: "BuildSubmitImage"
+
+  - name: "gcr.io/cloud-builders/docker"
+    args:
+      [
+        "push",
+        "${_GCR_PATH}/kaggle_submit:$COMMIT_SHA",
+      ]
+    id: "PushSubmitImage"
+    waitFor: ["BuildSubmitImage"]
+
+  - name: "python:3.7-slim"
+    entrypoint: "/bin/sh"
+    args: [
+        "-c",
+        "set -ex;
+        cd ${_CODE_PATH};
+        pip3 install cffi==1.12.3 --upgrade;
+        pip3 install kfp==0.1.38;
+        sed -i 's|image: download_image_location|image: ${_GCR_PATH}/kaggle_download:$COMMIT_SHA|g' ./download_dataset/component.yaml;
+        sed -i 's|image: visualizetable_image_location|image: ${_GCR_PATH}/kaggle_visualize_table:$COMMIT_SHA|g' ./visualize_table/component.yaml;
+        sed -i 's|image: visualizehtml_image_location|image: ${_GCR_PATH}/kaggle_visualize_html:$COMMIT_SHA|g' ./visualize_html/component.yaml;
+        sed -i 's|image: train_image_location|image: ${_GCR_PATH}/kaggle_train:$COMMIT_SHA|g' ./train_model/component.yaml;
+        sed -i 's|image: submit_image_location|image: ${_GCR_PATH}/kaggle_submit:$COMMIT_SHA|g' ./submit_result/component.yaml;
+        python pipeline.py
+        --gcr_address ${_GCR_PATH};
+        cp pipeline.py.zip /workspace/pipeline.zip",
+      ]
+    id: "KagglePackagePipeline"
+
+  - name: "gcr.io/cloud-builders/gsutil"
+    args:
+      [
+        "cp",
+        "/workspace/pipeline.zip",
+        "${_GS_BUCKET}/$COMMIT_SHA/pipeline.zip"
+      ]
+    id: "KaggleUploadPipeline"
+    waitFor: ["KagglePackagePipeline"]
+
+
+  - name: "gcr.io/cloud-builders/kubectl"
+    entrypoint: "/bin/sh"
+    args: [
+        "-c",
+        "cd ${_CODE_PATH};
+        apt-get update;
+        apt-get install -y python3-pip;
+        apt-get install -y libssl-dev libffi-dev;
+        /builder/kubectl.bash;
+        pip3 install kfp;
+        pip3 install kubernetes;
+        python3 create_pipeline_version_and_run.py 
+        --pipeline_id ${_PIPELINE_ID} 
+        --commit_sha $COMMIT_SHA
+        --bucket_name ${_GS_BUCKET}
+        --gcr_address ${_GCR_PATH}"
+      ]
+    env:
+      - "CLOUDSDK_COMPUTE_ZONE=[Your cluster zone, for example: us-central1-a]"
+      - "CLOUDSDK_CONTAINER_CLUSTER=[Your cluster name, for example: my-cluster]"
+    id: "KaggleCreatePipelineVersionAndRun"
+
+images:
+  - "${_GCR_PATH}/kaggle_download:latest"
+  - "${_GCR_PATH}/kaggle_visualize_table:latest"
+  - "${_GCR_PATH}/kaggle_visualize_html:latest"
+  - "${_GCR_PATH}/kaggle_train:latest"
+  - "${_GCR_PATH}/kaggle_submit:latest"
+
+
+substitutions:
+  _CODE_PATH: /workspace/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample
+  _NAMESPACE: kubeflow
+  _GCR_PATH: [Your cloud registry path. For example, gcr.io/my-project-id]
+  _GS_BUCKET: [Name of your cloud storage bucket. For example, gs://my-project-bucket]
+  _PIPELINE_ID: [Your kubeflow pipeline id to create a version on. Get it from Kubeflow Pipeline UI.
+                 For example, f6f8558a-6eec-4ef4-b343-a650473ee613]
diff --git a/...contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/create_pipeline_version_and_run.py b/...contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/create_pipeline_version_and_run.py
@@ -0,0 +1,47 @@
+import kfp
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--commit_sha', help='Required. Commit SHA, for version name. Must be unique.', type=str)
+parser.add_argument('--pipeline_id', help = 'Required. pipeline id',type=str)
+parser.add_argument('--bucket_name', help='Required. gs bucket to store files', type=str)
+parser.add_argument('--gcr_address', help='Required. Cloud registry address. For example, gcr.io/my-project', type=str)
+parser.add_argument('--host', help='Host address of kfp.Client. Will be get from cluster automatically', type=str, default='')
+parser.add_argument('--run_name', help='name of the new run.', type=str, default='')
+parser.add_argument('--experiment_id', help = 'experiment id',type=str)
+parser.add_argument('--code_source_url', help = 'url of source code', type=str, default='')
+args = parser.parse_args()
+
+if args.host:
+    client = kfp.Client(host=args.host)
+else:
+    client = kfp.Client()
+
+#create version
+import os
+package_url = os.path.join('https://storage.googleapis.com', args.bucket_name.lstrip('gs://'), args.commit_sha, 'pipeline.zip')
+version_name = args.commit_sha
+version_body = {"name": version_name, \
+"code_source_url": args.code_source_url, \
+"package_url": {"pipeline_url": package_url}, \
+"resource_references": [{"key": {"id": args.pipeline_id, "type":3}, "relationship":1}]}
+
+response = client.pipelines.create_pipeline_version(version_body)
+version_id = response.id
+# create run
+run_name = args.run_name if args.run_name else 'run' + version_id
+resource_references = [{"key": {"id": version_id, "type":4}, "relationship":2}]
+if args.experiment_id:
+    resource_references.append({"key": {"id": args.experiment_id, "type":1}, "relationship": 1})
+run_body={"name":run_name,
+          "pipeline_spec":{"parameters": [{"name": "bucket_name", "value": args.bucket_name}, 
+                                          {"name": "commit_sha", "value": args.commit_sha}]},
+          "resource_references": resource_references}
+try:
+    client.runs.create_run(run_body)
+except:
+    print('Error Creating Run...')
+
+
+
+
diff --git a/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/download_dataset/Dockerfile b/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/download_dataset/Dockerfile
@@ -0,0 +1,7 @@
+FROM python:3.7
+ENV KAGGLE_USERNAME=[YOUR KAGGLE USERNAME] \
+    KAGGLE_KEY=[YOUR KAGGLE KEY]
+RUN pip install kaggle
+RUN pip install google-cloud-storage
+COPY ./download_data.py .
+CMD ["python", "download_data.py"]
diff --git a/...es/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/download_dataset/component.yaml b/...es/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/download_dataset/component.yaml
@@ -0,0 +1,15 @@
+name: download dataset
+description: visualize training in tensorboard
+inputs:
+  - {name: bucket_name, type: GCSPath}
+outputs:
+  - {name: train_dataset, type: string}
+  - {name: test_dataset, type: string}
+implementation:
+  container:
+    image: download_image_location
+    command: ['python', 'download_data.py']
+    args: ['--bucket_name', {inputValue: bucket_name}]
+    fileOutputs: 
+      train_dataset: /train.txt
+      test_dataset: /test.txt
diff --git a/.../contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/download_dataset/download_data.py b/.../contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/download_dataset/download_data.py
@@ -0,0 +1,31 @@
+"""
+step #1: download data from kaggle website, and push it to gs bucket
+"""
+
+def process_and_upload(
+    bucket_name
+):
+    from google.cloud import storage
+    storage_client = storage.Client()
+    bucket = storage_client.get_bucket(bucket_name.lstrip('gs://'))
+    train_blob = bucket.blob('train.csv')
+    test_blob = bucket.blob('test.csv')
+    train_blob.upload_from_filename('train.csv')
+    test_blob.upload_from_filename('test.csv')
+
+    with open('train.txt', 'w') as f:
+        f.write(bucket_name+'/train.csv')
+    with open('test.txt', 'w') as f:
+        f.write(bucket_name+'/test.csv')
+
+if __name__ == '__main__':
+    import os
+    os.system("kaggle competitions download -c house-prices-advanced-regression-techniques")
+    os.system("unzip house-prices-advanced-regression-techniques")
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--bucket_name', type=str)
+    args = parser.parse_args()
+
+    process_and_upload(args.bucket_name)
+
diff --git a/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/pipeline.py b/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/pipeline.py
@@ -0,0 +1,40 @@
+import kfp.dsl as dsl
+import kfp.components as components
+from kfp.gcp import use_gcp_secret
+
+@dsl.pipeline(
+    name = "kaggle pipeline",
+    description = "kaggle pipeline that goes from download data, analyse data, train model to submit result"
+)
+def kaggle_houseprice(
+    bucket_name: str,
+    commit_sha: str
+):
+
+    downloadDataOp = components.load_component_from_file('./download_dataset/component.yaml')
+    downloadDataStep = downloadDataOp(bucket_name=bucket_name).apply(use_gcp_secret('user-gcp-sa'))
+
+    visualizeTableOp = components.load_component_from_file('./visualize_table/component.yaml')
+    visualizeTableStep = visualizeTableOp(train_file_path='%s' % downloadDataStep.outputs['train_dataset']).apply(use_gcp_secret('user-gcp-sa'))
+
+    visualizeHTMLOp = components.load_component_from_file('./visualize_html/component.yaml')
+    visualizeHTMLStep = visualizeHTMLOp(train_file_path='%s' % downloadDataStep.outputs['train_dataset'],
+                                        commit_sha=commit_sha,
+                                        bucket_name=bucket_name).apply(use_gcp_secret('user-gcp-sa'))
+
+    trainModelOp = components.load_component_from_file('./train_model/component.yaml')
+    trainModelStep = trainModelOp(train_file='%s' % downloadDataStep.outputs['train_dataset'],
+                                  test_file='%s' % downloadDataStep.outputs['test_dataset'],
+                                  bucket_name=bucket_name).apply(use_gcp_secret('user-gcp-sa'))
+
+    submitResultOp = components.load_component_from_file('./submit_result/component.yaml')
+    submitResultStep = submitResultOp(result_file='%s' % trainModelStep.outputs['result'],
+                                      submit_message='submit').apply(use_gcp_secret('user-gcp-sa'))
+
+if __name__ == '__main__':
+    import kfp.compiler as compiler
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--gcr_address', type = str)
+    args = parser.parse_args()
+    compiler.Compiler().compile(kaggle_houseprice, __file__ + '.zip')
diff --git a/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/submit_result/Dockerfile b/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/submit_result/Dockerfile
@@ -0,0 +1,7 @@
+FROM python:3.7
+ENV KAGGLE_USERNAME=[YOUR KAGGLE USERNAME] \
+    KAGGLE_KEY=[YOUR KAGGLE KEY]
+RUN pip install kaggle
+RUN pip install gcsfs
+COPY ./submit_result.py .
+CMD ["python", "submit_result.py"]
diff --git a/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/submit_result/component.yaml b/samples/contrib/versioned-pipeline-ci-samples/kaggle-ci-sample/submit_result/component.yaml
@@ -0,0 +1,11 @@
+name: submit result
+description: submit prediction result to kaggle
+inputs:
+  - {name: result_file, type: string}
+  - {name: submit_message, type: string}
+implementation:
+  container:
+    image: submit_image_location
+    command: ['python', 'submit_result.py']
+    args: ['--result_file', {inputValue: result_file},
+           '--submit_message', {inputValue: submit_message}]