Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tpot integration #37

Merged
merged 13 commits into from
Jun 15, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Xcessiv holds your hand through all the implementation details of creating and o
* Fully define your data source, cross-validation process, relevant metrics, and base learners with Python code
* Any model following the Scikit-learn API can be used as a base learner
* Task queue based architecture lets you take full advantage of multiple cores and embarrassingly parallel hyperparameter searches
* Direct integration with [TPOT](https://github.com/rhiever/tpot) for automated pipeline construction
* Automated hyperparameter search through Bayesian optimization
* Easy management and comparison of hundreds of different model-hyperparameter combinations
* Automatic saving of generated secondary meta-features
Expand Down
68 changes: 66 additions & 2 deletions docs/advanced.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,69 @@
Automated Tuning
================
Automated Runs
==============

Xcessiv includes support for various algorithms that aim to provide automation for things such as hyperparameter optimization and base learner/pipeline construction.

Once you begin an automated run, Xcessiv will take care of updating your base learner setups/base learners for you while you go do something else.

As of v0.4.0, Xcessiv supports two types of automated runs: Bayesian Hyperparameter Search and TPOT base learner construction.

TPOT base learner construction
------------------------------

Xcessiv is great for tuning different pipelines/base learners and stacking them together, but with all possible combinations of pipelines, it is a boon to use something that can build that pipeline for you automatically.

This is exactly what `TPOT <http://rhiever.github.io/tpot/>`_ promises to do for you.

As of v0.4, Xcessiv has built-in support for directly exporting the pipeline code generated by TPOT as a base learner setup in Xcessiv.

Right next to the **Add new base learner origin** button, click on the **Automated base learner generation with TPOT** button. In the modal that pops up, enter the following code.::

from tpot import TPOTClassifier

tpot_learner = TPOTClassifier(generations=5, population_size=50, verbosity=2)

To use TPOT, simply define a :class:`TPOTClassifer` or :class:`TPOTRegressor` and assign it to the variable ``tpot_learner``. The arguments for :class:`TPOTClassifer` or :class:`TPOTRegressor` can be found in the `TPOT API documentation <http://rhiever.github.io/tpot/api/>`_.

When you click **Go**, a new automated run will be created that runs ``tpot_learner`` on your training data then creates a new base learner setup containing the code for the best pipeline found by TPOT.

Once TPOT is finished, you'll likely end up with something like this in your newly generated base learner.::

import numpy as np

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(
Normalizer(norm="max"),
ExtraTreesClassifier(bootstrap=False, criterion="entropy", max_features=0.15, min_samples_leaf=7, min_samples_split=13, n_estimators=100)
)

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

To convert it to an Xcessiv-compatible base learner, remove all the unneeded parts and modify the code to this.::

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

base_learner = make_pipeline(
Normalizer(norm="max"),
ExtraTreesClassifier(bootstrap=False, criterion="entropy", max_features=0.15, min_samples_leaf=7, min_samples_split=13, n_estimators=100, random_state=8)
)

Notice two changes: we renamed ``exported_pipeline`` to ``base_learner`` to follow the Xcessiv format, and set the ``random_state`` parameter in the :class:`sklearn.ensemble.ExtraTreesClassifier` object to 8 for determinism.

Set the name, meta-feature generator, and metrics for your base learner setup as usual, then verify and confirm. You will now be able to use your curated pipeline as any other base learner in your Xcessiv workflow.

Bayesian Hyperparameter Search
------------------------------
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Features
* Fully define your data source, cross-validation process, relevant metrics, and base learners with Python code
* Any model following the Scikit-learn API can be used as a base learner
* Task queue based architecture lets you take full advantage of multiple cores and embarrassingly parallel hyperparameter searches
* Direct integration with `TPOT <https://github.com/rhiever/tpot>`_ for automated pipeline construction
* Automated hyperparameter search through Bayesian optimization
* Easy management and comparison of hundreds of different model-hyperparameter combinations
* Automatic saving of generated secondary meta-features
Expand Down
4 changes: 4 additions & 0 deletions docs/thirdparty.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ Here are a few example workflows using third party libraries that work well with
Xcessiv with TPOT
-----------------

.. admonition:: Note

As of v0.4, Xcessiv now provides direct integration with TPOT. View :ref:`TPOT base learner construction` for details. This section is kept here to demonstrate the power of stacking together different TPOT pipelines.

Xcessiv is a great tool for tuning different models and pipelines and stacking them into one big ensemble, but with all the possible combinations of pipelines, where would you even begin?

Enter TPOT.
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ scikit-learn>=0.18
scipy>=0.18
six>=1.10
SQLAlchemy>=1.1
TPOT>=0.8
5 changes: 3 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def run_tests(self):

setup(
name='xcessiv',
version='0.3.8',
version='0.4.0',
url='https://github.com/reiinakano/xcessiv',
license='Apache License 2.0',
author='Reiichiro Nakano',
Expand All @@ -48,7 +48,8 @@ def run_tests(self):
'scikit-learn>=0.18.0',
'scipy>=0.18.0',
'six>=1.10.0',
'SQLAlchemy>=1.1.0'
'SQLAlchemy>=1.1.0',
'TPOT>=0.8'
],
cmdclass={'test': PyTest},
author_email='reiichiro.s.nakano@gmail.com',
Expand Down
2 changes: 1 addition & 1 deletion xcessiv/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from flask import Flask


__version__ = '0.3.8'
__version__ = '0.4.0'


app = Flask(__name__, static_url_path='/static', static_folder='ui/build/static')
Expand Down
255 changes: 255 additions & 0 deletions xcessiv/automatedruns.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
"""This module contains functions for the automated runs"""
from __future__ import absolute_import, print_function, division, unicode_literals
from rq import get_current_job
from xcessiv import functions
from xcessiv import models
from xcessiv import constants
import numpy as np
import os
import sys
import traceback
from six import iteritems
import numbers
from bayes_opt import BayesianOptimization


def return_func_to_optimize(path, session, base_learner_origin, default_params,
metric_to_optimize, invert_metric, integers):
"""Creates the function to be optimized by Bayes Optimization.

The function automatically handles the case of already existing base learners, and if
no base learner for the hyperparameters exists yet, creates one and updates it in the
usual way.

Args:
path (str): Path to Xcessiv notebook

session: Database session passed down

base_learner_origin: BaseLearnerOrigin object

default_params (dict): Dictionary containing default params of estimator

metric_to_optimize (str, unicode): String containing name of metric to optimize

invert_metric (bool): Specifies whether metric should be inverted e.g. losses

integers (set): Set of strings that specify which hyperparameters are integers

Returns:
func_to_optimize (function): Function to be optimized
"""
def func_to_optimize(**params):
base_estimator = base_learner_origin.return_estimator()
base_estimator.set_params(**default_params)
# For integer hyperparameters, make sure they are rounded off
params = dict((key, val) if key not in integers else (key, int(val))
for key, val in iteritems(params))
base_estimator.set_params(**params)
hyperparameters = functions.make_serializable(base_estimator.get_params())

# Look if base learner already exists
base_learner = session.query(models.BaseLearner).\
filter_by(base_learner_origin_id=base_learner_origin.id,
hyperparameters=hyperparameters).first()

calculate_only = False

# If base learner exists and has finished, just return its result
if base_learner and base_learner.job_status == 'finished':
if invert_metric:
return -base_learner.individual_score[metric_to_optimize]
else:
return base_learner.individual_score[metric_to_optimize]

# else if base learner exists but is unfinished, just calculate the result without storing
elif base_learner and base_learner.job_status != 'finished':
calculate_only = True

# else if base learner does not exist, create it
else:
base_learner = models.BaseLearner(hyperparameters,
'started',
base_learner_origin)
base_learner.job_id = get_current_job().id
session.add(base_learner)
session.commit()

try:
est = base_learner.return_estimator()
extraction = session.query(models.Extraction).first()
X, y = extraction.return_train_dataset()
return_splits_iterable = functions.import_object_from_string_code(
extraction.meta_feature_generation['source'],
'return_splits_iterable'
)

meta_features_list = []
trues_list = []
for train_index, test_index in return_splits_iterable(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
est = est.fit(X_train, y_train)
meta_features_list.append(
getattr(est, base_learner.base_learner_origin.
meta_feature_generator)(X_test)
)
trues_list.append(y_test)
meta_features = np.concatenate(meta_features_list, axis=0)
y_true = np.concatenate(trues_list)

for key in base_learner.base_learner_origin.metric_generators:
metric_generator = functions.import_object_from_string_code(
base_learner.base_learner_origin.metric_generators[key],
'metric_generator'
)
base_learner.individual_score[key] = metric_generator(y_true, meta_features)

# Only do this if you want to save things
if not calculate_only:
meta_features_path = base_learner.meta_features_path(path)

if not os.path.exists(os.path.dirname(meta_features_path)):
os.makedirs(os.path.dirname(meta_features_path))

np.save(meta_features_path, meta_features, allow_pickle=False)
base_learner.job_status = 'finished'
base_learner.meta_features_exists = True
session.add(base_learner)
session.commit()

if invert_metric:
return -base_learner.individual_score[metric_to_optimize]
else:
return base_learner.individual_score[metric_to_optimize]

except:
session.rollback()
base_learner.job_status = 'errored'
base_learner.description['error_type'] = repr(sys.exc_info()[0])
base_learner.description['error_value'] = repr(sys.exc_info()[1])
base_learner.description['error_traceback'] = \
traceback.format_exception(*sys.exc_info())
session.add(base_learner)
session.commit()
raise
return func_to_optimize


def start_naive_bayes(automated_run, session, path):
"""Starts naive bayes automated run

Args:
automated_run (xcessiv.models.AutomatedRun): Automated run object

session: Valid SQLAlchemy session

path (str, unicode): Path to project folder
"""
module = functions.import_string_code_as_module(automated_run.source)
random_state = 8 if not hasattr(module, 'random_state') else module.random_state
assert module.metric_to_optimize in automated_run.base_learner_origin.metric_generators

# get non-searchable parameters
base_estimator = automated_run.base_learner_origin.return_estimator()
base_estimator.set_params(**module.default_params)
default_params = functions.make_serializable(base_estimator.get_params())
non_searchable_params = dict((key, val) for key, val in iteritems(default_params)
if key not in module.pbounds)

# get already calculated base learners in search space
existing_base_learners = []
for base_learner in automated_run.base_learner_origin.base_learners:
if not base_learner.job_status == 'finished':
continue
in_search_space = True
for key, val in iteritems(non_searchable_params):
if base_learner.hyperparameters[key] != val:
in_search_space = False
break # If no match, move on to the next base learner
if in_search_space:
existing_base_learners.append(base_learner)

# build initialize dictionary
target = []
initialization_dict = dict((key, list()) for key in module.pbounds.keys())
for base_learner in existing_base_learners:
# check if base learner's searchable hyperparameters are all numerical
all_numerical = True
for key in module.pbounds.keys():
if not isinstance(base_learner.hyperparameters[key], numbers.Number):
all_numerical = False
break
if not all_numerical:
continue # if there is a non-numerical hyperparameter, skip this.

for key in module.pbounds.keys():
initialization_dict[key].append(base_learner.hyperparameters[key])
target.append(base_learner.individual_score[module.metric_to_optimize])
initialization_dict['target'] = target if not module.invert_metric \
else list(map(lambda x: -x, target))
print('{} existing in initialization dictionary'.
format(len(initialization_dict['target'])))

# Create function to be optimized
func_to_optimize = return_func_to_optimize(
path, session, automated_run.base_learner_origin, module.default_params,
module.metric_to_optimize, module.invert_metric, set(module.integers)
)

# Create Bayes object
bo = BayesianOptimization(func_to_optimize, module.pbounds)

bo.initialize(initialization_dict)

np.random.seed(random_state)

bo.maximize(**module.maximize_config)

automated_run.job_status = 'finished'
session.add(automated_run)
session.commit()


def start_tpot(automated_run, session, path):
"""Starts a TPOT automated run that exports directly to base learner setup

Args:
automated_run (xcessiv.models.AutomatedRun): Automated run object

session: Valid SQLAlchemy session

path (str, unicode): Path to project folder
"""
module = functions.import_string_code_as_module(automated_run.source)
extraction = session.query(models.Extraction).first()
X, y = extraction.return_train_dataset()

tpot_learner = module.tpot_learner

tpot_learner.fit(X, y)

temp_filename = os.path.join(path, 'tpot-temp-export-{}'.format(os.getpid()))
tpot_learner.export(temp_filename)

with open(temp_filename) as f:
base_learner_source = f.read()

base_learner_source = constants.tpot_learner_docstring + base_learner_source

try:
os.remove(temp_filename)
except OSError:
pass

blo = models.BaseLearnerOrigin(
source=base_learner_source,
name='TPOT Learner',
meta_feature_generator='predict'
)

automated_run.job_status = 'finished'

session.add(blo)
session.add(automated_run)
session.commit()
Loading