scikit-learn hyperparameter search with GKE #6

dizcology · 2017-09-28T22:35:30Z

Run scikit-learn's GridSearchCV and RandomizedSearchCV on GKE.

dizcology · 2017-09-29T22:59:09Z

sklearn/hpsearch/gke_parallel.py

+        return body
+
+
+    def _deploy_job(self, worker_id, X_uri, y_uri):


In one experiment with 2 nodes and 4 workers, it happened that 3/4 of the jobs were deployed to the same node, making those three jobs taking much longer to finish than the last.

Why was this?

The logic of selecting which node to deploy a job to is managed by kubernetes.

Since there seems little benefit in deploying more than one pod to a single node, I will remove the per_node argument and default to one pod per node. I'll need to do some experiments to make sure in case of a multi-node cluster, no two pods are deployed to the same node.

dizcology · 2017-09-30T01:43:09Z

TODO:

update gke_randomized_search.ipynb
restructure the files to have a helpers/ folder

zomglings · 2017-10-17T00:18:25Z

sklearn/hpsearch/README.md

+
+## Introduction
+
+This sample package helps you run `scikit-learn`'s `GridSearchCV` and `RandomizedSearchCV` on [Google Container Engine](https://cloud.google.com/container-engine/).


Links to scikit-learn, GridSearchCV, RandomizedSearchCV would be good here.

zomglings · 2017-10-17T00:19:24Z

sklearn/hpsearch/README.md

+
+## Requirements
+
+You will need a Google Cloud Platform project which has the following products enabled:


Link to documentation about enabling APIs: https://support.google.com/cloud/answer/6158841?hl=en ?

zomglings · 2017-10-17T00:21:01Z

sklearn/hpsearch/README.md

+
+1. Install [Google Cloud Platform SDK](https://cloud.google.com/sdk/downloads).
+
+1. Install [kubectl](https://cloud.google.com/container-engine/docs/quickstart).


What do you think about providing an image on Docker Hub that has all these things set up?

(No reflection on this PR, just spitballing.)

The docker images built from this workflow requires the local modules (particularly gke_parallel.py) to be useful, so I am not sure if it is valuable to publish the image.

zomglings · 2017-10-17T00:21:44Z

sklearn/hpsearch/README.md

+
+1. Install [kubectl](https://cloud.google.com/container-engine/docs/quickstart).
+
+1. `git clone https://github.com/GoogleCloudPlatform/ml-on-gcp.git`


Tell user to "Run the following commands"

zomglings · 2017-10-17T00:22:35Z

sklearn/hpsearch/README.md

+
+1. `pip install -r requirements.txt`
+
+1. Follow the steps in either `gke_grid_search.ipynb` or `gke_randomized_search.ipynb`.


Link to notebooks.

zomglings · 2017-10-17T00:27:06Z

sklearn/hpsearch/README.md

+
+1. `pip install -r requirements.txt`
+
+1. Follow the steps in either `gke_grid_search.ipynb` or `gke_randomized_search.ipynb`.


I would like an explanation of what kind of magic you are doing to make those notebooks work. Some of this content seems to be in the notebooks themselves, but I would appreciate a lower-level explanation in a Markdown file out here.

zomglings · 2017-10-17T00:28:27Z

sklearn/hpsearch/gke_parallel.py

+        return body
+
+
+    def _deploy_job(self, worker_id, X_uri, y_uri):


Why was this?

zomglings · 2017-10-17T00:29:02Z

sklearn/hpsearch/gke_parallel.py

@@ -0,0 +1,288 @@
+# Copyright 2017, Google Inc. All rights reserved.


This is the file that I want a lot more documentation for - docstrings especially.

I must say, though, I find the code very readable. Nice job!

docstrings required in all code files.

nnegrey

Context for my comments: I'm viewing this as a tutorial guide that I can then take and add to my own project.

Notebook
I guess for some of the steps in the notebooks, since I'm not familiar with all these products I would like a description of why some of the steps are needed so that I can take it and customize the steps.

Example section: Refresh access token to the cluster: (I'm not sure why or when that step will be needed, so I see that I need to execute it, but I don't feel like I understand how to then take that and convert it to my own project)

But this could all just be that I'm unfamiliar with all the products.

Helper files:
For each of the helper files I would like a short description either at the top of each file or in the README so that I can quickly use that as a reference to get the short gist of the purpose of each file.

All in all I think the guide is awesome and I think a high level overview of all the working parts could be helpful.

nnegrey · 2018-01-23T17:08:05Z

sklearn/hpsearch/gke_parallel.py

+        return self._cancelled
+
+
+    # TODO: allow getting only the best result to save time


Same as above TODO, still needed?

Not implemented yet, might still be worth doing at some point.

nnegrey · 2018-01-23T17:08:35Z

sklearn/hpsearch/gke_parallel.py

+
+    # Implement part of the concurrent.future.Future interface.
+    def done(self):
+        # TODO: consider using kubernetes API to check if pod completed


Is this TODO still correct?

Still worth considering, however kubernetes client does not seem to consistently refresh the access token and so using kubernetes API to check job status seems unreliable.

nnegrey · 2018-01-23T17:09:59Z

sklearn/hpsearch/gke_parallel.py

+
+            pickle_and_upload(param_grid, self.bucket_name, '{}/{}/param_grid.pkl'.format(self.task_name, worker_id))
+
+            # TODO: Make sure that each job is deployed to a different node.


TODO implemented?

Not yet - need to implement round-robin across the nodes.

thinking about this again, I will move this (along with other TODOs) into an issue and we can evaluate later. I suspect most training tasks will not really benefit from multiple nodes. e.g. a single node of 64 cores is better than 2 nodes with 32 cores each.

nnegrey · 2018-01-23T17:10:50Z

sklearn/hpsearch/gke_parallel.py

+            # use it as is.
+            return search_spaces.values()
+        else:
+            # TODO: implement this


This TODO looks done

you are correct. comment removed.

nnegrey · 2018-01-23T17:17:56Z

sklearn/hpsearch/kubernetes_helper.py

+
+# brew install python to get 2.7.13 which has updated openssl
+# check openssl version with python -c "import ssl; print ssl.OPENSSL_VERSION"
+# mkvirtualenv -p /usr/local/Cellar/python/2.7.13_1/bin/python2 hpsearch 


What's the context for these 3 steps? (Are they important to the guide?)

nnegrey · 2018-01-23T17:25:48Z

sklearn/hpsearch/gke_bayes_search.ipynb

+    "\n",
+    "For illustration purposes we will use the MNIST dataset.  The following code downloads the dataset and puts it in `./mnist_data`.\n",
+    "\n",
+    "The first 60000 images and targets are the original training set, while the last 10000 are the testing set.  The training set is order by their labels so we shuffle them since we will use a very small portion of the data to shorten training time."


nit: "The training set is ordered

nnegrey · 2018-01-23T17:29:18Z

sklearn/hpsearch/gke_bayes_search.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Change this only if you have customized the source."


Is this referring to the source directory?

yes - the user needs to change this only they have changed the content of the source/ directory, and renamed it it something else.

happyhuman · 2018-01-23T22:25:38Z

This sample code is supposed to run using both Python 2 and 3, right? Just wondering.

dizcology · 2018-01-24T00:47:03Z

@happyhuman: I tested it only with python2. Note that the Dockerfile for the worker specifies python 2.7.13, which should be updated to whatever runtime you use locally.

@nnegrey:

Re: refresh access token seems related to this issue: RefreshError with config.load_kube_config() kubernetes-client/python#339
I'll add some brief explanations in each helper module. Typically something_helper.py hosts boiler plate codes to access "something" from inside of python/jupyter notebook.
The design of the tutorial is to entirely remain in a jupyter notebook environment, and the data scientist who wishes to use this shouldn't take parts of this tutorial into their project, but rather should bring their dataset and scikit-learn model into these notebooks and run training tasks on GKE. I'll update the top level README to make this clearer.

nnegrey · 2018-01-24T22:32:27Z

Oh nifty.

happyhuman

Looks good to me.

dizcology commented Sep 29, 2017

View reviewed changes

zomglings self-requested a review October 17, 2017 00:16

zomglings assigned dizcology Oct 17, 2017

zomglings suggested changes Oct 17, 2017

View reviewed changes

dizcology requested review from happyhuman and nnegrey January 23, 2018 16:47

nnegrey reviewed Jan 23, 2018

View reviewed changes

happyhuman previously approved these changes Jan 24, 2018

View reviewed changes

dizcology dismissed happyhuman’s stale review via 4d860dd January 24, 2018 23:11

zomglings previously approved these changes Jan 24, 2018

View reviewed changes

dizcology added 16 commits January 24, 2018 15:34

experimenting

181a6bc

wip

2ad97e1

gcs and cloudbuild helpers

01ab34c

working prototype

c7f8500

remove outdated

24fa526

remove outdated

6f213db

update notebook

99a48df

move hpsearch to sklearn/hpsearch

bfa6c94

remove notebook checkpoints

c666430

cleanup

a88c587

update README

4d37291

delegate predict methods, update notebook

2bbe8a3

add randomized search cv

c142889

move explanation text into markdown cells

beaff3e

update README

6b9d062

formatting

7ce59c3

dizcology added 27 commits January 24, 2018 15:34

update notebooks

3376da8

have a cell that handles delete cluster after jobs finish

b756a17

add helper to get pod logs

7e36b01

use get logs in notebooks

fa22255

delete cluster before getting results

dd2ea5d

support skopt.BayesSearchCV, WIP

d876487

support BayesSearchCV, WIP

bab6147

allow downloading after a call to result with download=False

5652b6c

add notebook for bayes search cv

1b6f9c2

fix partition

06b9fdb

update readme

bb83709

update readme

7e0e918

update readme

eebd1af

update readme

83e972b

update readme

e1acf3d

fix typo

cd0534a

remove outdated comment

a0101af

remove outdated comment

2f5963c

add helpers folder

a9e985e

add docstrings

23012ee

remove TODOs

37b4c3f

update README

3d5a293

markdown formatting

5e0617d

update README

93d8de2

update README

3b55abf

update README

c1a4da1

update top level README

87c76b4

dizcology dismissed zomglings’s stale review via 87c76b4 January 24, 2018 23:37

dizcology force-pushed the hpsearch branch from 911c32b to 87c76b4 Compare January 24, 2018 23:37

dizcology merged commit 51696b5 into GoogleCloudPlatform:master Jan 24, 2018


		## Introduction

		This sample package helps you run `scikit-learn`'s `GridSearchCV` and `RandomizedSearchCV` on [Google Container Engine](https://cloud.google.com/container-engine/).


		## Requirements

		You will need a Google Cloud Platform project which has the following products enabled:


		1. Install [Google Cloud Platform SDK](https://cloud.google.com/sdk/downloads).

		1. Install [kubectl](https://cloud.google.com/container-engine/docs/quickstart).


		1. Install [kubectl](https://cloud.google.com/container-engine/docs/quickstart).

		1. `git clone https://github.com/GoogleCloudPlatform/ml-on-gcp.git`


		1. `pip install -r requirements.txt`

		1. Follow the steps in either `gke_grid_search.ipynb` or `gke_randomized_search.ipynb`.

		@@ -0,0 +1,288 @@
		# Copyright 2017, Google Inc. All rights reserved.

		return self._cancelled


		# TODO: allow getting only the best result to save time


		pickle_and_upload(param_grid, self.bucket_name, '{}/{}/param_grid.pkl'.format(self.task_name, worker_id))

		# TODO: Make sure that each job is deployed to a different node.

scikit-learn hyperparameter search with GKE #6

scikit-learn hyperparameter search with GKE #6

Conversation

dizcology commented Sep 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dizcology commented Sep 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nnegrey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dizcology Jan 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

happyhuman commented Jan 23, 2018 • edited Loading

dizcology commented Jan 24, 2018 • edited Loading

nnegrey commented Jan 24, 2018

happyhuman left a comment

Choose a reason for hiding this comment

dizcology commented Sep 28, 2017 •

edited

Loading

dizcology commented Sep 30, 2017 •

edited

Loading

dizcology Jan 24, 2018 •

edited

Loading

happyhuman commented Jan 23, 2018 •

edited

Loading

dizcology commented Jan 24, 2018 •

edited

Loading