Most data science projects involve similar ingredients (loading data, defining some evaluation metrics, splitting data into different train/validation/test sets, etc.). Modev's goal is to ease these repetitive steps, without constraining the freedom data scientists need to develop models.
The easiest is to install from pip:
pip install modev
Otherwise you can clone the latest release and install it manually:
git clone git@github.com:pabloarosado/modev.git
cd modev
python setup.py install
Otherwise you can install from conda:
conda install -c pablorosado modev
The quickest way to get started with modev is to run a pipeline with the default settings:
import modev
pipe = modev.Pipeline()
pipe.run()
This runs a pipeline on some example data, and returns a dataframe with a ranking of approaches that perform best (given some metrics) on the data.
To get the data used in the pipeline:
pipe.get_data()
By default, modev splits the data into a playground and a test set. The test set is omitted (unless parameter execution_inputs['test_mode'] is set to True), and the playground is split into k train/dev folds, to do k-fold cross-validation. To get the indexes of train/dev/test sets:
pipe.get_indexes()
The pipeline will load two dummy approaches (which can be accessed on pipe.approaches_function
) with some
parameters (which can be accessed on pipe.approaches_pars
).
For each fold, these approaches will be fitted to the train set and predict the 'color' of the examples on the dev sets.
The metrics used to evaluate the performance of the approaches are listed in pipe.evaluation_pars['metrics']
.
An exhaustive grid search is performed, to get all possible combinations of the parameters of each of the approaches. The performance of each of these combinations on each fold can be accessed on:
pipe.get_results()
To plot these results per fold for each of the metrics:
pipe.plot_results()
To plot only a certain list of metrics, this list can be given as an argument of this function.
To get the final ranking of best approaches (after combining the results of different folds):
pipe.get_selected_models()
The inputs accepted by modev.Pipeline
refer to the usual ingredients in a data science project (data loading,
evaluation metrics, model selection method...).
We define an experiment as a combination of all these ingredients.
An experiment is defined by a dictionary with the following keys:
-
load_inputs
: Dictionary of inputs related to data loading:-
Using the default function.
If
function
is not given,modev.etl.load_local_file
will be used.
This function loads a local (.csv) file. It usespandas.read_csv
function and accepts all its arguments, and also some additional arguments.- Arguments that must be defined in
load_inputs
:data_file
: str
Path to local (.csv) file.
- Arguments that can optionally be defined in
load_inputs
:selection
: str or None
Selection to perform on the data. For example, if selection is"(data['height'] > 3) & (data['width'] < 2)"
, that selection will be evaluated and applied to the data; None to apply no selection.
Default: Nonesample_nrows
: int or None
Number of random rows to sample from the data (without repeating rows); None to load all rows.
Default: Nonerandom_state
: int or None
Random state (relevant only when sampling from data, i.e. whensample_nrows
is not None).
Default: None
- Arguments that must be defined in
-
Using a custom function.
If the
function
key is contained in theload_inputs
dictionary, its value must be a valid function.- Arguments that this custom function must accept:
This function can have an arbitrary number of mandatory arguments (or none), to be specified inload_inputs
. - Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in
load_inputs
dictionary. - Outputs this custom function must return:
data
: pd.DataFrame
Relevant data.
- Arguments that this custom function must accept:
-
-
validation_inputs
: Dictionary of inputs related to validation method (e.g. k-fold or temporal-fold cross-validation).-
Using the default function.
If
function
is not given,modev.validation.k_fold_playground_n_tests_split
will be used.
This function generates indexes that split data into a playground (with k folds) and n test sets. There is only one playground, which contains train and dev sets, and has no overlap with test sets. Playground is split into k folds, namely k non-overlapping dev sets, and k overlapping train sets. Each of the folds contains all data in the playground (part of it in train, and the rest in dev); hence train and dev sets of the same fold do not overlap.- Arguments that must be defined in
validation_inputs
:
None (all arguments will be taken from default if not explicitly given). - Arguments that can optionally be defined in
validation_inputs
:playground_n_folds
: int
Number of folds to split playground into (also calledk
), so that there will be k train sets and k dev sets.
Default: 4test_fraction
: float
Fraction of data to use for test sets.
Default: 0.2test_n_sets
: int
Number of test sets.
Default: 2labels
: list or None
Labels to stratify data according to their distribution; None to not stratify data.
Default: Noneshuffle
: bool
True to shuffle data before splitting; False to keep them sorted as they are before splitting.
Default: Truerandom_state
: int or None
Random state for shuffling; Ignored if 'shuffle' is False (in which case, 'random_state' can be set to None).
Default: Nonetest_mode
: bool
True to return indexes of the test set; False to return indexes of the dev set.
Default: False
- Arguments that must be defined in
-
Using a custom function.
If the
function
key is contained in thevalidation_inputs
dictionary, its value must be a valid function.- Arguments that this custom function must accept:
data
: pd.DataFrame
Indexed data (e.g. a dataframe whose index can be accessed withdata.index
).
- Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in
validation_inputs
dictionary. - Outputs this custom function must return:
train_indexes
: dict Indexes to use for training on the different k folds, e.g. for 10 folds:
{0: np.array([...]), 1: np.array([...]), ..., 10: np.array([...])}
test_indexes
: dict Indexes to use for evaluating (either dev or test) on the different k folds, e.g. for 10 folds and if test_mode is False:
{0: np.array([...]), 1: np.array([...]), ..., 10: np.array([...])}
- Arguments that this custom function must accept:
-
-
execution_inputs
: Dictionary of inputs related to the execution of approaches.-
Using the default function.
If
function
is not given,modev.execution.execute_model
will be used. This function defines the execution method (including training and prediction, and any possible preprocessing) for an approach. This function takes an approachapproach_function
with parametersapproach_pars
, a train set (with predictorstrain_x
and targetstrain_y
) and the predictors of a test settest_x
, and returns the predicted targets of the test set.
Note: Here,test
refers to either a dev or a test set indistinctly.- Arguments that must be defined in
execution_inputs
:target
: str
Name of target column in both train_set and test_set.
- Arguments that can optionally be defined in
execution_inputs
:
None (this function does not accept any other optional arguments).
- Arguments that must be defined in
-
Using a custom function.
If the
function
key is contained in theexecution_inputs
dictionary, its value must be a valid function.- Arguments that this custom function must accept:
model
: model object
Instantiated approach.data
: pd.DataFrame
Data, as returned by load inputs function.fold_train_indexes
: np.array
Indexes of train set (or playground set) for current fold.fold_test_indexes
: np.array
Indexes of dev set (or test set) for current fold.target
: str
Name of target column in both train_set and test_set.
- Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in
execution_inputs
dictionary. - Outputs this custom function must return:
execution_results
: dict
Execution results. It contains:
truth
: np.array of true values of the target in the dev (or test) set.prediction
: np.array of predicted values of the target in the dev (or test) set.
- Arguments that this custom function must accept:
-
-
evaluation_inputs
: Dictionary of inputs related to evaluation metrics.-
Using the default function.
If
function
is not given,modev.evaluation.evaluate_predictions
will be used.
This function evaluates predictions, given a ground truth, using a list of metrics.- Arguments that must be defined in
evaluation_inputs
:- metrics : list
Metrics to use for evaluation. Implemented methods include:precision
: usual precision in classification problems.recall
: usual recall in classification problems.f1
: usual f1-score in classification problems.accuracy
: usual accuracy in classification problems.precision_at_*
: precision at k (e.g. 'precision_at_10') or at k percent (e.g. 'precision_at_5_pct').recall_at_*
: recall at k (e.g. 'recall_at_10') or at k percent (e.g. 'recall_at_5_pct').threshold_at_*
: threshold at k (e.g. 'threshold_at_10') or at k percent (e.g. 'threshold_at_5_pct').
Note: For the time being, all metrics have to return only one number; In the case of a multi-class classification, a micro-average precision is returned.
- metrics : list
- Arguments that must be defined in
-
Using a custom function.
If the
function
key is contained in theevaluation_inputs
dictionary, its value must be a valid function.- Arguments that this custom function must accept:
execution_results
: dict
Execution results as returned by execution inputs function. It must contain a 'truth' and a 'prediction' key.
- Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in
evaluation_inputs
dictionary. - Outputs this custom function must return:
results
: dict
Results of evaluation. Each element in the dictionary corresponds to one of the metrics.
- Arguments that this custom function must accept:
-
-
exploration_inputs
: Dictionary of inputs related to the method to explore the parameter space (e.g. grid search or random search).-
Using the default function.
If
function
is not given,modev.exploration.GridSearch
will be used.
This class allows for a grid-search exploration of the parameter space. -
Using a custom function.
If the
function
key is contained in theexploration_inputs
dictionary, its value must be a valid class.- Arguments that this custom function must accept:
approaches_pars
: dict
Dictionaries of approaches. Each key corresponds to one approach name, and the value is a dictionary. This inner dictionary of an individual approach has one key per parameter, and the value is a list of parameter values to explore.folds
: list
List of folds (e.g.[0, 1, 2, 3]
).results
: pd.DataFrame or None
Existing results to load; None to initialise results from scratch.
- Additionally, this class can have an arbitrary number of optional arguments (or none), to be specified in
exploration_inputs
dictionary. - Methods this custom class must return:
initialise_results
: function
Initialise results dataframe and return it.select_executions_left
: function
Select rows of results left to be executed and return the number of rows.get_next_point
: function
Return next point of parameter space to be explored.
- Arguments that this custom function must accept:
-
-
selection_inputs
: Dictionary of inputs related to the model selection method.-
Using the default function.
If
function
is not given,modev.selection.model_selection
will be used.
This function takes the evaluation of approaches on some folds, and selects the best model.- Arguments that must be defined in
selection_inputs
:main_metric
: str
Name of the main metric (the one that has to be maximized).
- Arguments that can optionally be defined in
selection_inputs
:aggregation_method
: str
Aggregation method to use to combine evaluations of different folds (e.g. 'mean').
Default: 'mean'results_condition
: str or None
Condition to be applied to the results dataframe before combining results from different folds.
Default: Nonecombined_results_condition
: str or None
Condition to be applied to the results dataframe after combining results from different folds.
Default: None
- Arguments that must be defined in
-
Using a custom function.
If the
function
key is contained in theselection_inputs
dictionary, its value must be a valid function.- Arguments that this custom function must accept:
results
: pd.DataFrame
Evaluations of the performance of approaches on different data folds (output of function used inevaluation_inputs
).
- Additionally, this function can have an arbitrary number of optional arguments (or none), to be specified in
evaluation_inputs
dictionary. - Outputs this custom function must return:
combine_results_sorted
: pd.DataFrame
Ranking of results (sorted in descending value of 'main_metric') of approaches that fulfil the imposed conditions.
- Arguments that this custom function must accept:
-
-
approaches_inputs
: List of dictionaries, one per approach to be used.-
Definition of an approach.
Each dictionary in the list has at least two keys:
approach_name
: Name of the approach.function
: Actual approach (usually, a class with 'fit' and 'predict' methods).- Any other key in the dictionary of an approach will be assumed to be an argument of that approach.
To see some examples of simple approaches, seemodev.approaches.DummyPredictor
andmodev.approaches.RandomChoicePredictor
.
-
An experiment can be contained in a python module.
As an example, there is a template experiment in modev.templates
, that is a small variation with respect to the default experiment.
To start a pipeline on this experiment:
experiment = templates.experiment_01.experiment
pipe = Pipeline(**experiment)
And to run it follow the example in the quick guide.