Machine learning for internet of things related to water systems.
A collaborative open source project between Data to AI Lab at MIT and Xylem Inc.
- Documentation: https://D3-AI.github.io/water
- Homepage: https://github.com/D3-AI/water
The Water project is a collection of end-to-end solutions for machine learning tasks commonly found in monitoring water distribution and delivery networks. Most tasks utilize sensor data emanating from monitoring systems. We utilize the foundational innovations developed for automation of machine Learning at Data to AI Lab at MIT. This project is developed in close collaboration with Xylem Inc.
The salient aspects of this customized project are:
- A set of ready to use, well tested pipelines for different machine learning tasks. These are vetted through testing across multiple publicly available datasets for the same task.
- An easy interface to specify the task, pipeline, and generate results and summarize them.
- A production ready, deployable pipeline.
- An easy interface to
tune
pipelines using Bayesian Tuning and Bandits library. - A community oriented infrastructure to incorporate new pipelines.
- A robust continuous integration and testing infrastructure.
- A
learning database
recording all past outcomes --> tasks, pipelines, outcomes.
Before diving into the software usage, we briefly explain some concepts and terminology.
We call the smallest computational blocks used in a Machine Learning process primitives, which:
- Can be either classes or functions.
- Have some initialization arguments, which MLBlocks calls
init_params
. - Have some tunable hyperparameters, which have types and a list or range of valid values.
Primitives can be combined to form what we call Templates, which:
- Have a list of primitives.
- Have some initialization arguments, which correspond to the initialization arguments of their primitives.
- Have some tunable hyperparameters, which correspond to the tunable hyperparameters of their primitives.
Templates can be used to build Pipelines by taking and fixing a set of valid hyperparameters for a Template. Hence, Pipelines:
- Have a list of primitives, which corresponds to the list of primitives of their template.
- Have some initialization arguments, which correspond to the initialization arguments of their template.
- Have some hyperparameter values, which fall within the ranges of valid tunable hyperparameters of their template.
A pipeline can be fitted and evaluated using the MLPipeline API in MLBlocks.
In our current phase, we are addressing two tasks - time series classification and time series regression. To provide solutions for these two tasks we have two components.
This class is the one in charge of learning from the data and making predictions by building MLBlocks and later on tuning them using BTB
This class comes in two flavours in the form of subclasses, the TimeSeriesClassifier and the TimeSeriesRegressor, to be used in the corresponding problem types.
A class responsible for loading the time series data from CSV files, and return it in the format ready to be used by the TimeSeriesEstimator.
A dataset is a folder that contains time series data and information about a Machine Learning problem in the form of CSV and JSON files.
The expected contents of the dataset
folder are:
-
A
metadata.json
with information about all the tables found in the dataset. This file follows the Metadata.json schema with three small modifications:- The root document has a
name
entry, with the name of the dataset. - The foreign key columns are be of type
id
and subtypeforeign
. - The
datetime
columns that are time indexes need to have thetime_index
subtype.
- The root document has a
-
A CSV file containing the training samples with, at least, the following columns:
- A unique index
- A foreign key to at least one timeseries table
- A time index that works as the cutoff time for the training example
- If the problem is supervised, a target column.
Then, for each type of timeseries that exist in the dataset, there will be:
- A CSV file containing the id of each timeseries and any additional information associated with it
- A CSV file containing the timeseries data with the following columns:
- A unique index
- A foreign key to the timeseries table
- A time index
- At least a value column
We call tuning the process of, given a dataset and a template, find the pipeline derived from the given template that gets the best possible score on the given dataset.
This process usually involves fitting and evaluating multiple pipelines with different hyperparameter values on the same data while using optimization algorithms to deduce which hyperparameters are more likely to get the best results in the next iterations.
We call each one of these tries a tuning iteration.
The simplest and recommended way to install Water is using pip:
pip install ml-water
For development, you can also clone the repository and install it from sources
git clone git@github.com:D3-AI/water.git
cd water
make install-develop
In this example we will load some demo data using the TimeSeriesLoader and fetch it to the TimeSeriesClassifier for it to find the best possible pipeline, fit it using the given data and then make predictions from it.
We first create a loader instance, passing the path to the dataset, the name of the column that we want to predict, and the name of the table where this column can be taken from.
from water.loader import TimeSeriesLoader
loader = TimeSeriesLoader(
dataset_path='examples/datasets/ItalyPowerDemand',
target_table='demand',
target_column='target'
)
Then we call the loader.load
method, which will return three elements:
X
: The contents of the target table, where the training examples can be found, without the target column.y
: The target column, as extracted from the target table.data
: A dictionary containing the additional elements that the Pipeline will need to run, including the actual time series data.
X, y, data = loader.load()
X.head(5)
demand_id | timeseries_id | cutoff_time | |
---|---|---|---|
0 | 0 | 0 | 2010-01-25 |
1 | 1 | 1 | 2010-01-25 |
2 | 2 | 2 | 2010-01-25 |
3 | 3 | 3 | 2010-01-25 |
4 | 4 | 4 | 2010-01-25 |
y.head(5)
0 1
1 1
2 2
3 2
4 1
Name: target, dtype: int64
data.keys()
dict_keys(['entities', 'relationships', 'target_entity', 'target_column', 'dataset_name'])
If we want to split the data in train and test subsets, we can do so by splitting the X
and y
variables.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
Once we have loaded the data, we create a TimeSeriesClassifier instance and call its tune
method to
find the best possible pipeline for our data.
We start by importing the TimeSeriesClassifier
and creating an instance.
from water.estimators import TimeSeriesClassifier
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
tsc = TimeSeriesClassifier()
We then pass the X
and y
partitions and the data
dictionary, along with the number of tuning iterations
that we want to perform to the tune
method, which will run the indicated number of iterations
trying to figure out the best possible hyperparameters.
tsc.tune(X_train, y_train, data, iterations=5)
After the tuning process has finished, the hyperparameters have been already set in the classifier.
We can see the found hyperparameters by calling the get_hyperparameters
method
tsc.get_hyperparameters()
{'mlprimitives.preprocessing.ClassEncoder#1': {},
'featuretools.dfs#1': {'encode': True,
'max_depth': 2,
'remove_low_information': False},
'sklearn.preprocessing.Imputer#1': {'missing_values': 'NaN',
'axis': 0,
'copy': True,
'strategy': 'mean'},
'xgboost.XGBClassifier#1': {'n_jobs': -1,
'n_estimators': 409,
'max_depth': 8,
'learning_rate': 0.45727735286952875,
'gamma': 0.7661016859076536,
'min_child_weight': 2},
'mlprimitives.preprocessing.ClassDecoder#1': {}}
as well as the obtained cross validation score by looking at the score
attribute of the tsc
object
tsc.score
0.6212121212121213
Once we are satisfied with the obtained cross validation score, we can proceed to call
the fit
method passing again the same data elements.
tsc.fit(X_train, y_train, data)
After this, we are ready to make predictions on new data
predictions = tsc.predict(X_test, data)
predictions[0:5]
array([2, 1, 2, 2, 2])
The TimeSeriesClassifier and TimeSeriesRegressor have a default template for each problem.
This template can be overriden by passing a new template dictionary when the instance is created.
template = {
'primitives': [
'featuretools.dfs',
'sklearn.preprocessing.Imputer',
'sklearn.preprocessing.StandardScaler',
'sklearn.ensemble.RandomForestClassifier',
],
'init_params': {
'featuretools.dfs#1': {
'encode': True
}
}
}
tsc = TimeSeriesClassifier(template=template)
tsc.tune(X_train, y_train, data, iterations=5)
tsc.score
0.5404040404040403
tsc.fit(X_train, y_train, data)
predictions = tsc.predict(X_test, data)
predictions[0:5]
array([1, 1, 1, 1, 1])