reiinakano · reiinakano · Jun 2, 2017 · May 30, 2017 · May 30, 2017 · May 30, 2017
diff --git a/README.md b/README.md
@@ -30,6 +30,7 @@ Xcessiv holds your hand through all the implementation details of creating and o
 * Fully define your data source, cross-validation process, relevant metrics, and base learners with Python code
 * Any model following the Scikit-learn API can be used as a base learner
 * Task queue based architecture lets you take full advantage of multiple cores and embarrassingly parallel hyperparameter searches
+* Automated hyperparameter search through Bayesian optimization
 * Easy management and comparison of hundreds of different model-hyperparameter combinations
 * Automatic saving of generated secondary meta-features
 * Stacked ensemble creation in a few clicks

diff --git a/docs/advanced.rst b/docs/advanced.rst
@@ -0,0 +1,112 @@
+Automated Tuning
+================
+
+Bayesian Hyperparameter Search
+------------------------------
+
+Aside from grid search and random search that were covered in the previous chapter, Xcessiv offers another popular hyperparameter optimization method - `Bayesian optimization <https://en.wikipedia.org/wiki/Hyperparameter_optimization#Bayesian_optimization>`_.
+
+Unlike grid search and random search, where hyperparameters are explored independent of each other, Bayesian optimization records the results of previously explored hyperparameter combinations and uses them to figure out which hyperparameters to try next. Theoretically, this should allow for faster convergence to a local maximum and less time wasted on exploring hyperparameters that are not likely to produce good results.
+
+Keep in mind that there are a few limitations to this method. First, since the hyperparameter combinations to explore are based on previously explored hyperparameters, the Bayesian hyperparameter search cannot take advantage of multiple Xcessiv workers in the same way as Grid Search and Random Search. All hyperparameter combinations are explored by a single worker.
+
+Second, Bayesian optimization can only explore numerical hyperparameters. A hyperparameter that takes only strings (e.g. ``criterion`` in :class:`sklearn.ensemble.RandomForestClassifier`), cannot be tuned with Bayesian optimization. Instead, you must set the value or leave it at default before the search begins.
+
+The Bayesian optimization method used by Xcessiv is implemented through the open-source `BayesianOptimization <https://github.com/fmfn/BayesianOptimization>`_ Python package.
+
+Let's begin.
+
+Suppose you're exploring the hyperparameter space of a scikit-learn Random Forest classifier on some classification data. Your base learner setup will have this code.::
+
+   from sklearn.ensemble import RandomForestClassifier
+
+   base_learner = RandomForestClassifier(random_state=8)
+
+Make sure you also use "Accuracy" as a metric.
+
+You want to use Bayesian optimization to tune the hyperparameters ``max_depth``, ``min_samples_split``, and ``min_samples_leaf``. After verifying and finalizing the base learner, click the **Bayesian Optimization** button and enter the following configuration into the code block and hit Go.::
+
+   random_state = 8  # Random seed
+
+   # Default parameters of base learner
+   default_params = {
+     'n_estimators': 200,
+     'criterion': 'entropy'
+   }
+
+   # Min-max bounds of parameters to be searched
+   pbounds = {
+     'max_depth': (10, 300),
+     'min_samples_split': (0.001, 0.5),
+     'min_samples_leaf': (0.001, 0.5)
+   }
+
+   # List of hyperparameters that should be rounded off to integers
+   integers = [
+     'max_depth'
+   ]
+
+   metric_to_optimize = 'Accuracy'  # metric to optimize
+
+   invert_metric = False  # Whether or not to invert metric e.g. optimizing a loss
+
+   # Configuration to pass to maximize()
+   maximize_config = {
+     'init_points': 2,
+     'n_iter': 10,
+     'acq': 'ucb',
+     'kappa': 5
+   }
+
+If everything goes well, you should see that an "Automated Run" has started. From here, you can just watch as the Base Learners list updates with a new entry every time the Bayesian search explores a new hyperparameter combination.
+
+Let's review the code we used to configure the Bayesian search.
+
+All variables shown need to be defined for Bayesian search to work properly.
+
+First, the ``random_state`` parameter is used to seed the Numpy random generator that is used internally by the Bayesian search. You can set this to any integer you like.::
+
+  random_state = 8
+
+Next, define the default parameters of your base learner in the ``default_params`` dictionary. In our case, we don't really want to search ``n_estimators`` or ``criterion`` but we don't want to leave them at their default values either. This dictionary will set ``n_estimators`` to 200 and ``criterion`` to "entropy" for base learners produced by the Bayesian search. If ``default_params`` is an empty dictionary, the default values for all non-searchable hyperparameters will be used.::
+
+   default_params = {
+     'n_estimators': 200,
+     'criterion': 'entropy'
+   }
+
+The ``pbounds`` variable is a dictionary that maps the hyperparameters to tune with their minimum and maximum values. In our example, ``max_depth`` will be searched but kept between 10 and 300, while ``min_samples_split`` will be searched but kept between 0.001 and 0.5.::
+
+   # Min-max bounds of parameters to be searched
+   pbounds = {
+     'max_depth': (10, 300),
+     'min_samples_split': (0.001, 0.5),
+     'min_samples_leaf': (0.001, 0.5)
+   }
+
+``integers`` is an array containing the list of hyperparameters that should be converted to an integer before using it to configure the base learner. In our example ``max_depth`` only accepts integer values, so we add it to the list.::
+
+   # List of hyperparameters that should be rounded off to integers
+   integers = [
+     'max_depth'
+   ]
+
+``metric_to_optimize`` defines the metric that the Bayesian search will use to determine the effectiveness of a single base learner. In our case, the search optimizes for higher accuracy.
+
+``invert_metric`` must be set to ``True`` when the metric you are optimizing is "better" at a lower value. For example, metrics such as the Brier Score Loss and Mean Squared Error are better when they are smaller.::
+
+   metric_to_optimize = 'Accuracy'  # metric to optimize
+
+   invert_metric = False  # Whether or not to invert metric e.g. optimizing a loss
+
+``maximize_config`` is a dictionary of parameters used by the actual Bayesian search to dictate behavior such as the number of points to explore and the algorithm's acquisition function. ``init_points`` sets the number of initial points to randomly explore before the actual Bayesian search takes over. ``n_iter`` sets the number of hyperparameter combinations the Bayesian search will explore. ``acq`` and ``kappa`` refer to the parameters of the acquisition function and determine the search's balance between exploration and exploitation. Keys included in ``maximize_config`` that are not directly used by the Bayesian search process are passed on to the underlying :class:`sklearn.gaussian_process.GaussianProcessRegressor` object.::
+
+   # Configuration to pass to maximize()
+   maximize_config = {
+     'init_points': 2,
+     'n_iter': 10,
+     'acq': 'ucb',
+     'kappa': 5
+   }
+
+For more info on setting ``maximize_config``, please see the :func:`maximize` method of the :class:`bayes_opt.BayesianOptimization` class in the `BayesianOptimization source code <https://github.com/fmfn/BayesianOptimization/blob/master/bayes_opt/bayesian_optimization.py>`_. Seeing this `notebook example <https://github.com/fmfn/BayesianOptimization/blob/master/examples/exploitation%20vs%20exploration.ipynb>`_ will also give you some intuition on how the different acquisition function parameters ``acq``, ``kappa``, and ``xi`` affect the Bayesian search.
diff --git a/docs/index.rst b/docs/index.rst
@@ -16,6 +16,7 @@ Features
 * Fully define your data source, cross-validation process, relevant metrics, and base learners with Python code
 * Any model following the Scikit-learn API can be used as a base learner
 * Task queue based architecture lets you take full advantage of multiple cores and embarrassingly parallel hyperparameter searches
+* Automated hyperparameter search through Bayesian optimization
 * Easy management and comparison of hundreds of different model-hyperparameter combinations
 * Automatic saving of generated secondary meta-features
 * Stacked ensemble creation in a few clicks
@@ -58,6 +59,7 @@ Contents
 
    installation
    walkthrough
+   advanced
 
 
 Indices and tables

diff --git a/docs/walkthrough.rst b/docs/walkthrough.rst
@@ -384,6 +384,11 @@ At this point your list of base learners should look like this.
    :align: center
    :alt: List of base learners
 
+Bayesian Search
+~~~~~~~~~~~~~~~
+
+As of ``v0.3.0``, Xcessiv includes an experimental automated hyperparameter tuning functionality based on Bayesian search. For the purposes of this initial walkthrough, we will skip this and move on to the next section. A detailed tutorial for using Bayesian optimization can be found in :ref:`Bayesian Hyperparameter Search`.
+
 Creating a stacked ensemble
 ---------------------------
 

diff --git a/requirements.txt b/requirements.txt
@@ -1,3 +1,4 @@
+bayesian-optimization==0.4.0
 Flask>=0.11
 gevent>=1.1
 numpy>=1.12

diff --git a/setup.py b/setup.py
@@ -39,6 +39,7 @@ def run_tests(self):
     author='Reiichiro Nakano',
     tests_require=['pytest'],
     install_requires=[
+        'bayesian-optimization==0.4.0'
         'Flask>=0.11.0',
         'gevent>=1.1.0',
         'numpy>=1.12.0',

diff --git a/xcessiv/functions.py b/xcessiv/functions.py
@@ -72,6 +72,25 @@ def import_object_from_string_code(code, object):
         raise exceptions.UserError("{} not found in code".format(object))
 
 
+def import_string_code_as_module(code):
+    """Used to run arbitrary passed code as a module
+
+    Args:
+        code (string): Python code to import as module
+
+    Returns:
+        module: Python module
+    """
+    sha256 = hashlib.sha256(code.encode('UTF-8')).hexdigest()
+    module = imp.new_module(sha256)
+    try:
+        exec_(code, module.__dict__)
+    except Exception as e:
+        raise exceptions.UserError('User code exception', exception_message=str(e))
+    sys.modules[sha256] = module
+    return module
+
+
 def verify_dataset(X, y):
     """Verifies if a dataset is valid for use i.e. scikit-learn format
 

diff --git a/xcessiv/models.py b/xcessiv/models.py
@@ -145,6 +145,8 @@ class BaseLearnerOrigin(Base):
     description = Column(JsonEncodedDict)
     base_learners = relationship('BaseLearner', back_populates='base_learner_origin',
                                  cascade='all, delete-orphan', single_parent=True)
+    automated_runs = relationship('AutomatedRun', back_populates='base_learner_origin',
+                                  cascade='all, delete-orphan', single_parent=True)
     stacked_ensembles = relationship('StackedEnsemble', back_populates='base_learner_origin',
                                      cascade='all, delete-orphan', single_parent=True)
 
@@ -193,6 +195,37 @@ def cleanup(self, path):
             learner.cleanup(path)
 
 
+class AutomatedRun(Base):
+    """This table contains initialized/completed automated hyperparameter searches"""
+    __tablename__ = 'automatedrun'
+
+    id = Column(Integer, primary_key=True)
+    source = Column(Text)
+    job_status = Column(Text)
+    job_id = Column(Text)
+    description = Column(JsonEncodedDict)
+    base_learner_origin_id = Column(Integer, ForeignKey('baselearnerorigin.id'))
+    base_learner_origin = relationship('BaseLearnerOrigin', back_populates='automated_runs')
+
+    def __init__(self, source, job_status, base_learner_origin):
+        self.source = source
+        self.job_status = job_status
+        self.job_id = None
+        self.description = dict()
+        self.base_learner_origin = base_learner_origin
+
+    @property
+    def serialize(self):
+        return dict(
+            id=self.id,
+            source=self.source,
+            job_status=self.job_status,
+            job_id=self.job_id,
+            description=self.description,
+            base_learner_origin_id=self.base_learner_origin_id
+        )
+
+
 association_table = Table(
     'association', Base.metadata,
     Column('baselearner_id', Integer, ForeignKey('baselearner.id')),

diff --git a/xcessiv/presets/metricsetting.py b/xcessiv/presets/metricsetting.py
@@ -228,6 +228,7 @@ def metric_generator(y_true, y_probas):
     binarized = label_binarize(y_true, classes_)
     if len(classes_) == 2:
         binarized = binarized.ravel()
+        y_probas = y_probas[:, 1]
     return roc_auc_score(binarized, y_probas, average='weighted')
 """,
     'selection_name': 'ROC AUC Score from Scores/Probabilities'