Skip to content

Commit

Permalink
Update documentation.
Browse files Browse the repository at this point in the history
  • Loading branch information
isaksamsten committed Feb 2, 2024
1 parent 573341d commit 9ae4b45
Show file tree
Hide file tree
Showing 7 changed files with 407 additions and 434 deletions.
253 changes: 250 additions & 3 deletions docs/guide.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,256 @@
.. currentmodule:: wildboar

##########
User guide
User Guide
##########

The user guide intends to introduce the estimators and methods Wildboar has to
offer.
Typically, the configuration of machine learning problems involves a collection
of `n` data samples with the objective of forecasting attributes of unfamiliar
data. In the Wildboar framework, the focus is on machine learning problems
where the data samples are in the form of series, such as time series or other
types of data that are ordered chronologically or logically.

.. note::
For solving general machine learning problems with Python, consider using
`scikit-learn <https://scikit-learn.org>`__.

Similar to general machine learning problems, temporal machine learning is
concerned with problems that fall into different categories

- Supervised learning, in which the data series are labeled with additional
information. The additional information can be either numerical or nominal

- In classification the time series belong to one of two or more labels and the
goal is to learn a function that can label unlabeled time series.

- In regression the time series are labeled with a numerical attribute and the
task is to assign a new numerical value to an unlabeled time series.

**************************
Loading an example dataset
**************************

In order to start the exploration of Wildboar and temporal machine learning,
it is essential to acquire a set of data. Wildboar conveniently includes
several conventional datasets sourced from the time series community, which are
accessible in the UCR Time series repository.

In the following example, we load the dataset ``synthetic_control`` and the
``TwoLeadECG`` datasets.

.. code-block:: python
from wildboar.datasets import load_synthetic_control, load_two_lead_ecg
x, y = load_synthetic_control()
x_train, x_test, y_train, y_test = load_two_lead_ecg(merge_train_test=False)
Preserving the original training and testing splits from the UCR repository can
be achieved by disabling the ``merge_train_test`` option.

A more robust and reliable method for splitting the datasets into training
and testing partitions is to use the model selection functions provided by
scikit-learn.

.. code-block:: python
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
The datasets are Numpy ``ndarray`` s with :python:`x.ndim==2` and
:python:`y.ndim==1`. We can get the number of samples and time points.

.. code-block:: python
n_samples, n_timestep = x.shape
.. note::

Wildboar also supports multivariate time series using *3d*-arrays, i.e., we can
get the shape of the dataset by :python:`n_samples, n_dims, n_timestep =
x.shape`. Since operations are often performed over the temporal dimension, we
opt for having that as the last dimension of the array. Since we prefer c-order
arrays, this means that the data is contiguous in memory. A robust approach for
getting the number of samples and number of time steps irrespective of univariate (*2d*)
or multivariate (*3d*) time series is:

.. code-block:: python
n_samples, n_timestep = x.shape[0], x.shape[-1]
In the example, we use ``load_two_lead_ecg`` and ``load_synthetic_control`` to
load the datasets. A more general approach is to use the
:func:`~datasets.load_dataset`-function from the same modules.

.. code-block:: python
from wildboar.datasets import load_dataset
x, y = load_dataset("synthetic_control")
:func:`~datasets.load_dataset` accepts multiple parameters for specifying where
to load data from and how to preprocess the data. By default, Wildboar loads
datasets from the ``wildboar/ucr`` repository, which include datasets from UCR
time series repository. The user can specify a different repository using the
``repository`` argument. For example, we can load the regression task
``FloodModeling1`` from the UEA & UCR Time Series Extrinsic Regression
Repository, standardizing each time series with zero mean and unit variance
using the following snippet:

.. code-block:: python
x, y = load_dataset(
"FloodModeling1", repository="wildboar/tsereg", preprocess="standardize"
)
***********************
Learning and predicting
***********************

Estimators in Wildboar implements the same interface as estimators of
scikit-learn. We can ``fit`` an estimator to an input dataset and ``predict``
the label of a new sample.

An example of a temporal estimator is the
:class:`ensemble.ShapeletForestClassifier` which implements the random shapelet
forest classifier.

.. code-block:: python
from wildboar.ensemble import ShapeletForestClassifier
clf = ShapeletForestClassifier()
clf.fit(x_train, y_train)
We fit the classifier (:python:`clf`) using the training samples, and use
the same object to predict the label of a previously unseen sample.

.. code-block:: python
clf.predict(x_test[-1:, :]) # outputs array([6.])
.. note::
The predict function expects a ``ndarray`` of shape :python:`(n_samples,
n_timestep)`, where ``n_timestep`` is the size of training timestep.

Wildboar also simplifies experimentation over multiple datasets by allowing the
user to repeatedly load several datasets from a repository.

.. code-block:: python
from wildboar.datasets import load_datasets
for name, (x_train, x_test, y_train, y_test) in load_datasets(
"wildboar/ucr",
collection="bake-off",
merge_train_test=False,
filter="n_samples<=300",
):
clf = clone(clf) # from sklearn import clone
clf.fit(x, y)
print(f"{name}: {clf.score(x, y)}")
In the example, we load all datasets in the ``bake-off`` collection from the
``wildboar/ucr`` repository, filtering datasets with less than 300 samples. For
each dataset we load, we clone (to reuse the same random seed) and fit the
estimator. Then we print the dataset name and the predictive performance to the
screen. You can :doc:`read more about datasets in the API-documentation
<api/wildboar/datasets/index>`.

****************************************
Transforming time series to tabular data
****************************************

Despite the numerous estimators specialized for temporal data, an even larger
collection of methods exist for tabular data (e.g., as implemented by
`scikit-learn <https://scikit-learn.org>`__). For this purpose, Wildboar
implements several `transformers` that can be used to transform temporal data
to tabular data. Wildboar estimators follow the same convention as scikit-learn
and implements a ``fit``-method that learns the representation and a
``transform``-method that outputs a new tabular representation of each input
sample.

.. code-block:: python
from wildboar.transform import RocketTransform
rocket = RocketTransform()
rocket.fit(x)
tabluar_x = rocket.transform(x)
One of Wildboars main design goals is to seamlessly interoperate with
scikit-learn. As such, we can use Wildboar transformers to build
:doc:`scikit-learn pipelines
<sklearn:modules/generated/sklearn.pipeline.Pipeline>`.

.. code-block:: python
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
clf = make_pipeline(
RocketTransform(),
LogisticRegression(),
)
clf.fit(x, y)
clf.score(x, y)
.. warning::

In the above example, we train and evaluate the model on the same data. This
is bad practice. Instead, we should use a proper hold-out set when
estimating the pipelines performance.

***************************
Exploring model performance
***************************

Wildboar implements several methods for explaining classifiers, e.g., using
counterfactual reasoning or input dependencies.

.. code-block:: python
from wildboar.explain import IntervalImportance
i = IntervalImportance()
i.fit(clf, x, y)
i.plot(x, y=y)
The :class:`wildboar.explain.IntervalImportance`-class identifies temporal
regions that are responsible for the classifier performance. It does so by
breaking the dependency between continuous intervals and the label while
reevaluating the predictive performance of the classifier of sample-wise
shuffled intervals. In the example, we evaluate the in-sample importance, which
captures the reliance of the model on a particular interval.

.. ldimage:: /_static/fig/getting-started/interval.svg
:align: center

The :meth:`explain.IntervalImportance.plot` method can be used to visualize the
interval importance, or we can return the full importance matrix.

.. code-block:: python
>>> i.importance_.mean()
[..., 0.31, 0.30, 0.34, ...]
*****************
Model persistence
*****************

All Wildboar models can be persisted to disk using
`pickle <https://docs.python.org/3/library/pickle.html>`__

.. code-block:: python
import pickle
repr = pickle.dumps(clf) # clf fitted earlier
clf_ = pickle.loads(repr)
clf_.predict(x_test[-1:, :]) # outputs array([6.])
Models persisted using an older version of Wildboar is not guaranteed to
work when using a newer version (or vice versa).

.. warning::
The pickle module is not secure. Only unpickle data you trust. `Read more in
the Python documentation <https://docs.python.org/3/library/pickle.html>`__

.. toctree::
:maxdepth: 2
Expand Down
4 changes: 2 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ This is the API reference manual for `Wildboar <https://wildboar.dev>`__.
:maxdepth: 3
:hidden:

quickstart
Install <install>
guide
API <api/wildboar/index>
examples
API Reference <api/wildboar/index>
more/whatsnew

.. include:: api/index.rst
Expand Down
77 changes: 77 additions & 0 deletions docs/install.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
################
Install wildboar
################

There are a few options to install wildboar:

- Install the latest official distribution from `PyPi
<https://pypi.org/project/wildboar>`_. This is the recommended approach for
most users.

- Build and compile the package from source. This provides the fastest binaries
targeted for the specific platform.

Binary distributions are automatically built for macOS, GNU/Linux and Windows.
The binaries can be installed through `PyPi <https://pypi.org/project/wildboar>`_.

.. code-block:: shell
pip install wildboar
If you are on a system where users don't have write-accesses to the location of
Python packages, the distribution can be installed in the user directory.

.. code-block:: shell
pip install --user wildboar
You can also specify a specific version to install by replacing `wildboar` with,
e.g., `wildboar==1.2.0` where `1.2.0` is the version to install.

To avoid conflicts with already installed packages, it is strongly recommended
installing the package in a
`virtual environment <https://docs.python.org/3/tutorial/venv.html>`_. You can set
up a virtual environment using `venv`.

.. code-block:: shell
python3 -m venv .venv # create a virtual environment in the folder .venv
source .venv/bin/activate
pip install wildboar
.. note::

Depending on your operating system, there are some possible ceavets. While its
outside the scope of this documentation to enumerate all of these, we have
collected a few common issues here.

.. tab:: Debian

For Debian based distributions `python3-venv` must be installed for virtual
environments to work.

.. code-block:: shell
apt install python3-venv
.. tab:: MacOS

For users of MacOS it is recommended to install python using
`Homebrew <https://brew.sh/>`_

.. code-block:: shell
brew install python
.. tab:: Windows

For users of Windows it is recommended to use
`Anaconda <https://docs.conda.io/en/latest>`_ or
`Miniconda <https://docs.conda.io/en/latest/miniconda.html>`_. ``wildboar`` is
still installed using ``pip``

.. toctree::
:maxdepth: 3
:hidden:

install/build
Loading

0 comments on commit 9ae4b45

Please sign in to comment.