Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scikit-learn hyperparameter search with GKE #6

Merged
merged 44 commits into from
Jan 24, 2018

Conversation

dizcology
Copy link
Member

@dizcology dizcology commented Sep 28, 2017

Run scikit-learn's GridSearchCV and RandomizedSearchCV on GKE.

return body


def _deploy_job(self, worker_id, X_uri, y_uri):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one experiment with 2 nodes and 4 workers, it happened that 3/4 of the jobs were deployed to the same node, making those three jobs taking much longer to finish than the last.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of selecting which node to deploy a job to is managed by kubernetes.

Since there seems little benefit in deploying more than one pod to a single node, I will remove the per_node argument and default to one pod per node. I'll need to do some experiments to make sure in case of a multi-node cluster, no two pods are deployed to the same node.

@dizcology
Copy link
Member Author

dizcology commented Sep 30, 2017

TODO:

  • update gke_randomized_search.ipynb
  • restructure the files to have a helpers/ folder


## Introduction

This sample package helps you run `scikit-learn`'s `GridSearchCV` and `RandomizedSearchCV` on [Google Container Engine](https://cloud.google.com/container-engine/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Links to scikit-learn, GridSearchCV, RandomizedSearchCV would be good here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Requirements

You will need a Google Cloud Platform project which has the following products enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to documentation about enabling APIs: https://support.google.com/cloud/answer/6158841?hl=en ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


1. Install [Google Cloud Platform SDK](https://cloud.google.com/sdk/downloads).

1. Install [kubectl](https://cloud.google.com/container-engine/docs/quickstart).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about providing an image on Docker Hub that has all these things set up?

(No reflection on this PR, just spitballing.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker images built from this workflow requires the local modules (particularly gke_parallel.py) to be useful, so I am not sure if it is valuable to publish the image.


1. Install [kubectl](https://cloud.google.com/container-engine/docs/quickstart).

1. `git clone https://github.com/GoogleCloudPlatform/ml-on-gcp.git`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell user to "Run the following commands"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


1. `pip install -r requirements.txt`

1. Follow the steps in either `gke_grid_search.ipynb` or `gke_randomized_search.ipynb`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to notebooks.


1. `pip install -r requirements.txt`

1. Follow the steps in either `gke_grid_search.ipynb` or `gke_randomized_search.ipynb`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like an explanation of what kind of magic you are doing to make those notebooks work. Some of this content seems to be in the notebooks themselves, but I would appreciate a lower-level explanation in a Markdown file out here.

return body


def _deploy_job(self, worker_id, X_uri, y_uri):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this?

@@ -0,0 +1,288 @@
# Copyright 2017, Google Inc. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the file that I want a lot more documentation for - docstrings especially.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say, though, I find the code very readable. Nice job!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstrings required in all code files.

Copy link

@nnegrey nnegrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context for my comments: I'm viewing this as a tutorial guide that I can then take and add to my own project.

  1. Notebook
    I guess for some of the steps in the notebooks, since I'm not familiar with all these products I would like a description of why some of the steps are needed so that I can take it and customize the steps.

Example section: Refresh access token to the cluster: (I'm not sure why or when that step will be needed, so I see that I need to execute it, but I don't feel like I understand how to then take that and convert it to my own project)

But this could all just be that I'm unfamiliar with all the products.

  1. Helper files:
    For each of the helper files I would like a short description either at the top of each file or in the README so that I can quickly use that as a reference to get the short gist of the purpose of each file.

All in all I think the guide is awesome and I think a high level overview of all the working parts could be helpful.

return self._cancelled


# TODO: allow getting only the best result to save time
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above TODO, still needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not implemented yet, might still be worth doing at some point.


# Implement part of the concurrent.future.Future interface.
def done(self):
# TODO: consider using kubernetes API to check if pod completed
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still worth considering, however kubernetes client does not seem to consistently refresh the access token and so using kubernetes API to check job status seems unreliable.


pickle_and_upload(param_grid, self.bucket_name, '{}/{}/param_grid.pkl'.format(self.task_name, worker_id))

# TODO: Make sure that each job is deployed to a different node.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet - need to implement round-robin across the nodes.

Copy link
Member Author

@dizcology dizcology Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about this again, I will move this (along with other TODOs) into an issue and we can evaluate later. I suspect most training tasks will not really benefit from multiple nodes. e.g. a single node of 64 cores is better than 2 nodes with 32 cores each.

# use it as is.
return search_spaces.values()
else:
# TODO: implement this
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO looks done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct. comment removed.


# brew install python to get 2.7.13 which has updated openssl
# check openssl version with python -c "import ssl; print ssl.OPENSSL_VERSION"
# mkvirtualenv -p /usr/local/Cellar/python/2.7.13_1/bin/python2 hpsearch
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the context for these 3 steps? (Are they important to the guide?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

"\n",
"For illustration purposes we will use the MNIST dataset. The following code downloads the dataset and puts it in `./mnist_data`.\n",
"\n",
"The first 60000 images and targets are the original training set, while the last 10000 are the testing set. The training set is order by their labels so we shuffle them since we will use a very small portion of the data to shorten training time."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "The training set is ordered

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"cell_type": "markdown",
"metadata": {},
"source": [
"Change this only if you have customized the source."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this referring to the source directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - the user needs to change this only they have changed the content of the source/ directory, and renamed it it something else.

@happyhuman
Copy link
Contributor

happyhuman commented Jan 23, 2018

This sample code is supposed to run using both Python 2 and 3, right? Just wondering.

@dizcology
Copy link
Member Author

dizcology commented Jan 24, 2018

@happyhuman: I tested it only with python2. Note that the Dockerfile for the worker specifies python 2.7.13, which should be updated to whatever runtime you use locally.

@nnegrey:

  • Re: refresh access token seems related to this issue: RefreshError with config.load_kube_config() kubernetes-client/python#339

  • I'll add some brief explanations in each helper module. Typically something_helper.py hosts boiler plate codes to access "something" from inside of python/jupyter notebook.

  • The design of the tutorial is to entirely remain in a jupyter notebook environment, and the data scientist who wishes to use this shouldn't take parts of this tutorial into their project, but rather should bring their dataset and scikit-learn model into these notebooks and run training tasks on GKE. I'll update the top level README to make this clearer.

@nnegrey
Copy link

nnegrey commented Jan 24, 2018

Oh nifty.

happyhuman
happyhuman previously approved these changes Jan 24, 2018
Copy link
Contributor

@happyhuman happyhuman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

zomglings
zomglings previously approved these changes Jan 24, 2018
@dizcology dizcology merged commit 51696b5 into GoogleCloudPlatform:master Jan 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants