-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scikit-learn hyperparameter search with GKE #6
Conversation
return body | ||
|
||
|
||
def _deploy_job(self, worker_id, X_uri, y_uri): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one experiment with 2 nodes and 4 workers, it happened that 3/4 of the jobs were deployed to the same node, making those three jobs taking much longer to finish than the last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of selecting which node to deploy a job to is managed by kubernetes.
Since there seems little benefit in deploying more than one pod to a single node, I will remove the per_node
argument and default to one pod per node. I'll need to do some experiments to make sure in case of a multi-node cluster, no two pods are deployed to the same node.
TODO:
|
sklearn/hpsearch/README.md
Outdated
|
||
## Introduction | ||
|
||
This sample package helps you run `scikit-learn`'s `GridSearchCV` and `RandomizedSearchCV` on [Google Container Engine](https://cloud.google.com/container-engine/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Links to scikit-learn
, GridSearchCV
, RandomizedSearchCV
would be good here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/hpsearch/README.md
Outdated
|
||
## Requirements | ||
|
||
You will need a Google Cloud Platform project which has the following products enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to documentation about enabling APIs: https://support.google.com/cloud/answer/6158841?hl=en ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
1. Install [Google Cloud Platform SDK](https://cloud.google.com/sdk/downloads). | ||
|
||
1. Install [kubectl](https://cloud.google.com/container-engine/docs/quickstart). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about providing an image on Docker Hub that has all these things set up?
(No reflection on this PR, just spitballing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker images built from this workflow requires the local modules (particularly gke_parallel.py
) to be useful, so I am not sure if it is valuable to publish the image.
sklearn/hpsearch/README.md
Outdated
|
||
1. Install [kubectl](https://cloud.google.com/container-engine/docs/quickstart). | ||
|
||
1. `git clone https://github.com/GoogleCloudPlatform/ml-on-gcp.git` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tell user to "Run the following commands"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
sklearn/hpsearch/README.md
Outdated
|
||
1. `pip install -r requirements.txt` | ||
|
||
1. Follow the steps in either `gke_grid_search.ipynb` or `gke_randomized_search.ipynb`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to notebooks.
sklearn/hpsearch/README.md
Outdated
|
||
1. `pip install -r requirements.txt` | ||
|
||
1. Follow the steps in either `gke_grid_search.ipynb` or `gke_randomized_search.ipynb`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like an explanation of what kind of magic you are doing to make those notebooks work. Some of this content seems to be in the notebooks themselves, but I would appreciate a lower-level explanation in a Markdown file out here.
return body | ||
|
||
|
||
def _deploy_job(self, worker_id, X_uri, y_uri): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this?
@@ -0,0 +1,288 @@ | |||
# Copyright 2017, Google Inc. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the file that I want a lot more documentation for - docstrings especially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must say, though, I find the code very readable. Nice job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstrings required in all code files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context for my comments: I'm viewing this as a tutorial guide that I can then take and add to my own project.
- Notebook
I guess for some of the steps in the notebooks, since I'm not familiar with all these products I would like a description of why some of the steps are needed so that I can take it and customize the steps.
Example section: Refresh access token to the cluster: (I'm not sure why or when that step will be needed, so I see that I need to execute it, but I don't feel like I understand how to then take that and convert it to my own project)
But this could all just be that I'm unfamiliar with all the products.
- Helper files:
For each of the helper files I would like a short description either at the top of each file or in the README so that I can quickly use that as a reference to get the short gist of the purpose of each file.
All in all I think the guide is awesome and I think a high level overview of all the working parts could be helpful.
sklearn/hpsearch/gke_parallel.py
Outdated
return self._cancelled | ||
|
||
|
||
# TODO: allow getting only the best result to save time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above TODO, still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not implemented yet, might still be worth doing at some point.
sklearn/hpsearch/gke_parallel.py
Outdated
|
||
# Implement part of the concurrent.future.Future interface. | ||
def done(self): | ||
# TODO: consider using kubernetes API to check if pod completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO still correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still worth considering, however kubernetes client does not seem to consistently refresh the access token and so using kubernetes API to check job status seems unreliable.
sklearn/hpsearch/gke_parallel.py
Outdated
|
||
pickle_and_upload(param_grid, self.bucket_name, '{}/{}/param_grid.pkl'.format(self.task_name, worker_id)) | ||
|
||
# TODO: Make sure that each job is deployed to a different node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet - need to implement round-robin across the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about this again, I will move this (along with other TODOs) into an issue and we can evaluate later. I suspect most training tasks will not really benefit from multiple nodes. e.g. a single node of 64 cores is better than 2 nodes with 32 cores each.
sklearn/hpsearch/gke_parallel.py
Outdated
# use it as is. | ||
return search_spaces.values() | ||
else: | ||
# TODO: implement this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO looks done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are correct. comment removed.
|
||
# brew install python to get 2.7.13 which has updated openssl | ||
# check openssl version with python -c "import ssl; print ssl.OPENSSL_VERSION" | ||
# mkvirtualenv -p /usr/local/Cellar/python/2.7.13_1/bin/python2 hpsearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the context for these 3 steps? (Are they important to the guide?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
"\n", | ||
"For illustration purposes we will use the MNIST dataset. The following code downloads the dataset and puts it in `./mnist_data`.\n", | ||
"\n", | ||
"The first 60000 images and targets are the original training set, while the last 10000 are the testing set. The training set is order by their labels so we shuffle them since we will use a very small portion of the data to shorten training time." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "The training set is ordered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Change this only if you have customized the source." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this referring to the source directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - the user needs to change this only they have changed the content of the source/
directory, and renamed it it something else.
This sample code is supposed to run using both Python 2 and 3, right? Just wondering. |
@happyhuman: I tested it only with python2. Note that the Dockerfile for the worker specifies python 2.7.13, which should be updated to whatever runtime you use locally.
|
Oh nifty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Run scikit-learn's GridSearchCV and RandomizedSearchCV on GKE.