Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental bayes opt #18

Merged
merged 26 commits into from
Jun 2, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
718498b
add AutomatedRun db model
reiinakano May 30, 2017
9ed0c26
add new function import_string_code_as_module
reiinakano May 30, 2017
9e490f3
add view start_automated_run
reiinakano May 30, 2017
4507710
add rqtask start_automated_run
reiinakano May 30, 2017
ec1807b
Merge branch 'master' into experimental-bayes-opt
reiinakano May 31, 2017
09626de
unfinished rq
reiinakano May 31, 2017
abed320
unfinshed rq task push
reiinakano May 31, 2017
4b7394c
finish function to return function to be optimized
reiinakano May 31, 2017
3df59b2
I think it's finished???
reiinakano May 31, 2017
796731c
add view to get automated runs
reiinakano May 31, 2017
a3f2867
few more fixes
reiinakano May 31, 2017
6e43352
add view for getting or deleting specific automated run
reiinakano May 31, 2017
ff67a85
bugifxes
reiinakano May 31, 2017
3a72565
add UI for creating automated runs
reiinakano Jun 1, 2017
7332627
add Panel for showing automated runs
reiinakano Jun 1, 2017
26107db
add neat little collpasible panela nd react-table
reiinakano Jun 1, 2017
c520df9
add info and delete columns that don't do anything yet
reiinakano Jun 1, 2017
a900d5b
add More Details modal
reiinakano Jun 1, 2017
ab80ee8
add delete functionality to automated run.
reiinakano Jun 1, 2017
cceb113
make sure list shows "Queued" instead of a loading spinner if just qu…
reiinakano Jun 1, 2017
7f42ddf
fix to AUC preset
reiinakano Jun 1, 2017
2b47e45
allow ensembling with only one base learner
reiinakano Jun 1, 2017
bdc4d26
add some unfinished docs for bayes opt
reiinakano Jun 2, 2017
4348ef2
unfinished docs
reiinakano Jun 2, 2017
86244bd
finish docs for Bayesian search
reiinakano Jun 2, 2017
8c142a0
imporove docs
reiinakano Jun 2, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Xcessiv holds your hand through all the implementation details of creating and o
* Fully define your data source, cross-validation process, relevant metrics, and base learners with Python code
* Any model following the Scikit-learn API can be used as a base learner
* Task queue based architecture lets you take full advantage of multiple cores and embarrassingly parallel hyperparameter searches
* Automated hyperparameter search through Bayesian optimization
* Easy management and comparison of hundreds of different model-hyperparameter combinations
* Automatic saving of generated secondary meta-features
* Stacked ensemble creation in a few clicks
Expand Down
112 changes: 112 additions & 0 deletions docs/advanced.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
Automated Tuning
================

Bayesian Hyperparameter Search
------------------------------

Aside from grid search and random search that were covered in the previous chapter, Xcessiv offers another popular hyperparameter optimization method - `Bayesian optimization <https://en.wikipedia.org/wiki/Hyperparameter_optimization#Bayesian_optimization>`_.

Unlike grid search and random search, where hyperparameters are explored independent of each other, Bayesian optimization records the results of previously explored hyperparameter combinations and uses them to figure out which hyperparameters to try next. Theoretically, this should allow for faster convergence to a local maximum and less time wasted on exploring hyperparameters that are not likely to produce good results.

Keep in mind that there are a few limitations to this method. First, since the hyperparameter combinations to explore are based on previously explored hyperparameters, the Bayesian hyperparameter search cannot take advantage of multiple Xcessiv workers in the same way as Grid Search and Random Search. All hyperparameter combinations are explored by a single worker.

Second, Bayesian optimization can only explore numerical hyperparameters. A hyperparameter that takes only strings (e.g. ``criterion`` in :class:`sklearn.ensemble.RandomForestClassifier`), cannot be tuned with Bayesian optimization. Instead, you must set the value or leave it at default before the search begins.

The Bayesian optimization method used by Xcessiv is implemented through the open-source `BayesianOptimization <https://github.com/fmfn/BayesianOptimization>`_ Python package.

Let's begin.

Suppose you're exploring the hyperparameter space of a scikit-learn Random Forest classifier on some classification data. Your base learner setup will have this code.::

from sklearn.ensemble import RandomForestClassifier

base_learner = RandomForestClassifier(random_state=8)

Make sure you also use "Accuracy" as a metric.

You want to use Bayesian optimization to tune the hyperparameters ``max_depth``, ``min_samples_split``, and ``min_samples_leaf``. After verifying and finalizing the base learner, click the **Bayesian Optimization** button and enter the following configuration into the code block and hit Go.::

random_state = 8 # Random seed

# Default parameters of base learner
default_params = {
'n_estimators': 200,
'criterion': 'entropy'
}

# Min-max bounds of parameters to be searched
pbounds = {
'max_depth': (10, 300),
'min_samples_split': (0.001, 0.5),
'min_samples_leaf': (0.001, 0.5)
}

# List of hyperparameters that should be rounded off to integers
integers = [
'max_depth'
]

metric_to_optimize = 'Accuracy' # metric to optimize

invert_metric = False # Whether or not to invert metric e.g. optimizing a loss

# Configuration to pass to maximize()
maximize_config = {
'init_points': 2,
'n_iter': 10,
'acq': 'ucb',
'kappa': 5
}

If everything goes well, you should see that an "Automated Run" has started. From here, you can just watch as the Base Learners list updates with a new entry every time the Bayesian search explores a new hyperparameter combination.

Let's review the code we used to configure the Bayesian search.

All variables shown need to be defined for Bayesian search to work properly.

First, the ``random_state`` parameter is used to seed the Numpy random generator that is used internally by the Bayesian search. You can set this to any integer you like.::

random_state = 8

Next, define the default parameters of your base learner in the ``default_params`` dictionary. In our case, we don't really want to search ``n_estimators`` or ``criterion`` but we don't want to leave them at their default values either. This dictionary will set ``n_estimators`` to 200 and ``criterion`` to "entropy" for base learners produced by the Bayesian search. If ``default_params`` is an empty dictionary, the default values for all non-searchable hyperparameters will be used.::

default_params = {
'n_estimators': 200,
'criterion': 'entropy'
}

The ``pbounds`` variable is a dictionary that maps the hyperparameters to tune with their minimum and maximum values. In our example, ``max_depth`` will be searched but kept between 10 and 300, while ``min_samples_split`` will be searched but kept between 0.001 and 0.5.::

# Min-max bounds of parameters to be searched
pbounds = {
'max_depth': (10, 300),
'min_samples_split': (0.001, 0.5),
'min_samples_leaf': (0.001, 0.5)
}

``integers`` is an array containing the list of hyperparameters that should be converted to an integer before using it to configure the base learner. In our example ``max_depth`` only accepts integer values, so we add it to the list.::

# List of hyperparameters that should be rounded off to integers
integers = [
'max_depth'
]

``metric_to_optimize`` defines the metric that the Bayesian search will use to determine the effectiveness of a single base learner. In our case, the search optimizes for higher accuracy.

``invert_metric`` must be set to ``True`` when the metric you are optimizing is "better" at a lower value. For example, metrics such as the Brier Score Loss and Mean Squared Error are better when they are smaller.::

metric_to_optimize = 'Accuracy' # metric to optimize

invert_metric = False # Whether or not to invert metric e.g. optimizing a loss

``maximize_config`` is a dictionary of parameters used by the actual Bayesian search to dictate behavior such as the number of points to explore and the algorithm's acquisition function. ``init_points`` sets the number of initial points to randomly explore before the actual Bayesian search takes over. ``n_iter`` sets the number of hyperparameter combinations the Bayesian search will explore. ``acq`` and ``kappa`` refer to the parameters of the acquisition function and determine the search's balance between exploration and exploitation. Keys included in ``maximize_config`` that are not directly used by the Bayesian search process are passed on to the underlying :class:`sklearn.gaussian_process.GaussianProcessRegressor` object.::

# Configuration to pass to maximize()
maximize_config = {
'init_points': 2,
'n_iter': 10,
'acq': 'ucb',
'kappa': 5
}

For more info on setting ``maximize_config``, please see the :func:`maximize` method of the :class:`bayes_opt.BayesianOptimization` class in the `BayesianOptimization source code <https://github.com/fmfn/BayesianOptimization/blob/master/bayes_opt/bayesian_optimization.py>`_. Seeing this `notebook example <https://github.com/fmfn/BayesianOptimization/blob/master/examples/exploitation%20vs%20exploration.ipynb>`_ will also give you some intuition on how the different acquisition function parameters ``acq``, ``kappa``, and ``xi`` affect the Bayesian search.
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Features
* Fully define your data source, cross-validation process, relevant metrics, and base learners with Python code
* Any model following the Scikit-learn API can be used as a base learner
* Task queue based architecture lets you take full advantage of multiple cores and embarrassingly parallel hyperparameter searches
* Automated hyperparameter search through Bayesian optimization
* Easy management and comparison of hundreds of different model-hyperparameter combinations
* Automatic saving of generated secondary meta-features
* Stacked ensemble creation in a few clicks
Expand Down Expand Up @@ -58,6 +59,7 @@ Contents

installation
walkthrough
advanced


Indices and tables
Expand Down
5 changes: 5 additions & 0 deletions docs/walkthrough.rst
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,11 @@ At this point your list of base learners should look like this.
:align: center
:alt: List of base learners

Bayesian Search
~~~~~~~~~~~~~~~

As of ``v0.3.0``, Xcessiv includes an experimental automated hyperparameter tuning functionality based on Bayesian search. For the purposes of this initial walkthrough, we will skip this and move on to the next section. A detailed tutorial for using Bayesian optimization can be found in :ref:`Bayesian Hyperparameter Search`.

Creating a stacked ensemble
---------------------------

Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
bayesian-optimization==0.4.0
Flask>=0.11
gevent>=1.1
numpy>=1.12
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ def run_tests(self):
author='Reiichiro Nakano',
tests_require=['pytest'],
install_requires=[
'bayesian-optimization==0.4.0'
'Flask>=0.11.0',
'gevent>=1.1.0',
'numpy>=1.12.0',
Expand Down
19 changes: 19 additions & 0 deletions xcessiv/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,25 @@ def import_object_from_string_code(code, object):
raise exceptions.UserError("{} not found in code".format(object))


def import_string_code_as_module(code):
"""Used to run arbitrary passed code as a module

Args:
code (string): Python code to import as module

Returns:
module: Python module
"""
sha256 = hashlib.sha256(code.encode('UTF-8')).hexdigest()
module = imp.new_module(sha256)
try:
exec_(code, module.__dict__)
except Exception as e:
raise exceptions.UserError('User code exception', exception_message=str(e))
sys.modules[sha256] = module
return module


def verify_dataset(X, y):
"""Verifies if a dataset is valid for use i.e. scikit-learn format

Expand Down
33 changes: 33 additions & 0 deletions xcessiv/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,8 @@ class BaseLearnerOrigin(Base):
description = Column(JsonEncodedDict)
base_learners = relationship('BaseLearner', back_populates='base_learner_origin',
cascade='all, delete-orphan', single_parent=True)
automated_runs = relationship('AutomatedRun', back_populates='base_learner_origin',
cascade='all, delete-orphan', single_parent=True)
stacked_ensembles = relationship('StackedEnsemble', back_populates='base_learner_origin',
cascade='all, delete-orphan', single_parent=True)

Expand Down Expand Up @@ -193,6 +195,37 @@ def cleanup(self, path):
learner.cleanup(path)


class AutomatedRun(Base):
"""This table contains initialized/completed automated hyperparameter searches"""
__tablename__ = 'automatedrun'

id = Column(Integer, primary_key=True)
source = Column(Text)
job_status = Column(Text)
job_id = Column(Text)
description = Column(JsonEncodedDict)
base_learner_origin_id = Column(Integer, ForeignKey('baselearnerorigin.id'))
base_learner_origin = relationship('BaseLearnerOrigin', back_populates='automated_runs')

def __init__(self, source, job_status, base_learner_origin):
self.source = source
self.job_status = job_status
self.job_id = None
self.description = dict()
self.base_learner_origin = base_learner_origin

@property
def serialize(self):
return dict(
id=self.id,
source=self.source,
job_status=self.job_status,
job_id=self.job_id,
description=self.description,
base_learner_origin_id=self.base_learner_origin_id
)


association_table = Table(
'association', Base.metadata,
Column('baselearner_id', Integer, ForeignKey('baselearner.id')),
Expand Down
1 change: 1 addition & 0 deletions xcessiv/presets/metricsetting.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,7 @@ def metric_generator(y_true, y_probas):
binarized = label_binarize(y_true, classes_)
if len(classes_) == 2:
binarized = binarized.ravel()
y_probas = y_probas[:, 1]
return roc_auc_score(binarized, y_probas, average='weighted')
""",
'selection_name': 'ROC AUC Score from Scores/Probabilities'
Expand Down
Loading