Update documentation.

wildboar-foundation · Feb 2, 2024 · 9ae4b45 · 9ae4b45
1 parent 573341d
commit 9ae4b45
Show file tree

Hide file tree

Showing 7 changed files with 407 additions and 434 deletions.
diff --git a/docs/guide.rst b/docs/guide.rst
@@ -1,9 +1,256 @@
+.. currentmodule:: wildboar
+
 ##########
-User guide
+User Guide
 ##########
 
-The user guide intends to introduce the estimators and methods Wildboar has to
-offer.
+Typically, the configuration of machine learning problems involves a collection
+of `n` data samples with the objective of forecasting attributes of unfamiliar
+data. In the Wildboar framework, the focus is on machine learning problems
+where the data samples are in the form of series, such as time series or other
+types of data that are ordered chronologically or logically.
+
+.. note::
+   For solving general machine learning problems with Python, consider using
+   `scikit-learn <https://scikit-learn.org>`__.
+
+Similar to general machine learning problems, temporal machine learning is
+concerned with problems that fall into different categories
+
+- Supervised learning, in which the data series are labeled with additional
+  information. The additional information can be either numerical or nominal
+
+- In classification the time series belong to one of two or more labels and the
+  goal is to learn a function that can label unlabeled time series.
+
+- In regression the time series are labeled with a numerical attribute and the
+  task is to assign a new numerical value to an unlabeled time series.
+
+**************************
+Loading an example dataset
+**************************
+
+In order to start the exploration of Wildboar and temporal machine learning,
+it is essential to acquire a set of data. Wildboar conveniently includes
+several conventional datasets sourced from the time series community, which are
+accessible in the UCR Time series repository.
+
+In the following example, we load the dataset ``synthetic_control`` and the
+``TwoLeadECG`` datasets.
+
+.. code-block:: python
+
+   from wildboar.datasets import load_synthetic_control, load_two_lead_ecg
+   x, y = load_synthetic_control()
+   x_train, x_test, y_train, y_test = load_two_lead_ecg(merge_train_test=False)
+
+
+Preserving the original training and testing splits from the UCR repository can
+be achieved by disabling the ``merge_train_test`` option.
+
+A more robust and reliable method for splitting the datasets into training
+and testing partitions is to use the model selection functions provided by
+scikit-learn.
+
+.. code-block:: python
+
+   from sklearn.model_selection import train_test_split
+   x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
+
+The datasets are Numpy ``ndarray`` s with :python:`x.ndim==2` and
+:python:`y.ndim==1`. We can get the number of samples and time points.
+
+.. code-block:: python
+
+   n_samples, n_timestep = x.shape
+
+.. note::
+
+   Wildboar also supports multivariate time series using *3d*-arrays, i.e., we can
+   get the shape of the dataset by :python:`n_samples, n_dims, n_timestep =
+   x.shape`. Since operations are often performed over the temporal dimension, we
+   opt for having that as the last dimension of the array. Since we prefer c-order
+   arrays, this means that the data is contiguous in memory. A robust approach for
+   getting the number of samples and number of time steps irrespective of univariate (*2d*)
+   or multivariate (*3d*) time series is:
+
+   .. code-block:: python
+
+      n_samples, n_timestep = x.shape[0], x.shape[-1]
+
+In the example, we use ``load_two_lead_ecg`` and ``load_synthetic_control`` to
+load the datasets. A more general approach is to use the
+:func:`~datasets.load_dataset`-function from the same modules.
+
+.. code-block:: python
+
+   from wildboar.datasets import load_dataset
+   x, y = load_dataset("synthetic_control")
+
+:func:`~datasets.load_dataset` accepts multiple parameters for specifying where
+to load data from and how to preprocess the data. By default, Wildboar loads
+datasets from the ``wildboar/ucr`` repository, which include datasets from UCR
+time series repository. The user can specify a different repository using the
+``repository`` argument. For example, we can load the regression task
+``FloodModeling1`` from the UEA & UCR Time Series Extrinsic Regression
+Repository, standardizing each time series with zero mean and unit variance
+using the following snippet:
+
+.. code-block:: python
+
+   x, y = load_dataset(
+      "FloodModeling1", repository="wildboar/tsereg", preprocess="standardize"
+   )
+
+***********************
+Learning and predicting
+***********************
+
+Estimators in Wildboar implements the same interface as estimators of
+scikit-learn. We can ``fit`` an estimator to an input dataset and ``predict``
+the label of a new sample.
+
+An example of a temporal estimator is the
+:class:`ensemble.ShapeletForestClassifier` which implements the random shapelet
+forest classifier.
+
+.. code-block:: python
+
+   from wildboar.ensemble import ShapeletForestClassifier
+   clf = ShapeletForestClassifier()
+   clf.fit(x_train, y_train)
+
+We fit the classifier (:python:`clf`) using the training samples, and use
+the same object to predict the label of a previously unseen sample.
+
+.. code-block:: python
+
+   clf.predict(x_test[-1:, :])  # outputs array([6.])
+
+.. note::
+   The predict function expects a ``ndarray`` of shape :python:`(n_samples,
+   n_timestep)`, where ``n_timestep`` is the size of training timestep.
+
+Wildboar also simplifies experimentation over multiple datasets by allowing the
+user to repeatedly load several datasets from a repository.
+
+.. code-block:: python
+
+   from wildboar.datasets import load_datasets
+
+   for name, (x_train, x_test, y_train, y_test) in load_datasets(
+      "wildboar/ucr",
+      collection="bake-off",
+      merge_train_test=False,
+      filter="n_samples<=300",
+   ):
+      clf = clone(clf)  # from sklearn import clone
+      clf.fit(x, y)
+      print(f"{name}: {clf.score(x, y)}")
+
+
+In the example, we load all datasets in the ``bake-off`` collection from the
+``wildboar/ucr`` repository, filtering datasets with less than 300 samples. For
+each dataset we load, we clone (to reuse the same random seed) and fit the
+estimator. Then we print the dataset name and the predictive performance to the
+screen. You can :doc:`read more about datasets in the API-documentation
+<api/wildboar/datasets/index>`.
+
+****************************************
+Transforming time series to tabular data
+****************************************
+
+Despite the numerous estimators specialized for temporal data, an even larger
+collection of methods exist for tabular data (e.g., as implemented by
+`scikit-learn <https://scikit-learn.org>`__). For this purpose, Wildboar
+implements several `transformers` that can be used to transform temporal data
+to tabular data. Wildboar estimators follow the same convention as scikit-learn
+and implements a ``fit``-method that learns the representation and a
+``transform``-method that outputs a new tabular representation of each input
+sample.
+
+.. code-block:: python
+
+   from wildboar.transform import RocketTransform
+   rocket = RocketTransform()
+   rocket.fit(x)
+   tabluar_x = rocket.transform(x)
+
+One of Wildboars main design goals is to seamlessly interoperate with
+scikit-learn. As such, we can use Wildboar transformers to build
+:doc:`scikit-learn pipelines
+<sklearn:modules/generated/sklearn.pipeline.Pipeline>`.
+
+.. code-block:: python
+
+   from sklearn.pipeline import make_pipeline
+   from sklearn.linear_model import LogisticRegression
+
+   clf = make_pipeline(
+      RocketTransform(),
+      LogisticRegression(),
+   )
+   clf.fit(x, y)
+   clf.score(x, y)
+
+.. warning::
+
+   In the above example, we train and evaluate the model on the same data. This
+   is bad practice. Instead, we should use a proper hold-out set when
+   estimating the pipelines performance.
+
+***************************
+Exploring model performance
+***************************
+
+Wildboar implements several methods for explaining classifiers, e.g., using
+counterfactual reasoning or input dependencies.
+
+.. code-block:: python
+
+   from wildboar.explain import IntervalImportance
+   i = IntervalImportance()
+   i.fit(clf, x, y)
+   i.plot(x, y=y)
+
+The :class:`wildboar.explain.IntervalImportance`-class identifies temporal
+regions that are responsible for the classifier performance. It does so by
+breaking the dependency between continuous intervals and the label while
+reevaluating the predictive performance of the classifier of sample-wise
+shuffled intervals. In the example, we evaluate the in-sample importance, which
+captures the reliance of the model on a particular interval.
+
+.. ldimage:: /_static/fig/getting-started/interval.svg
+   :align: center
+
+The :meth:`explain.IntervalImportance.plot` method can be used to visualize the
+interval importance, or we can return the full importance matrix.
+
+.. code-block:: python
+
+   >>> i.importance_.mean()
+   [..., 0.31, 0.30, 0.34, ...]
+
+*****************
+Model persistence
+*****************
+
+All Wildboar models can be persisted to disk using
+`pickle  <https://docs.python.org/3/library/pickle.html>`__
+
+.. code-block:: python
+
+   import pickle
+   repr = pickle.dumps(clf) # clf fitted earlier
+   clf_ = pickle.loads(repr)
+   clf_.predict(x_test[-1:, :]) # outputs array([6.])
+
+Models persisted using an older version of Wildboar is not guaranteed to
+work when using a newer version (or vice versa).
+
+.. warning::
+   The pickle module is not secure. Only unpickle data you trust. `Read more in
+   the Python documentation <https://docs.python.org/3/library/pickle.html>`__
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/index.rst b/docs/index.rst
@@ -8,10 +8,10 @@ This is the API reference manual for `Wildboar <https://wildboar.dev>`__.
   :maxdepth: 3
   :hidden:
 
-  quickstart
+  Install <install>
   guide
+  API <api/wildboar/index>
   examples
-  API Reference <api/wildboar/index>
   more/whatsnew
 
 .. include:: api/index.rst

diff --git a/docs/install.rst b/docs/install.rst
@@ -0,0 +1,77 @@
+################
+Install wildboar
+################
+
+There are a few options to install wildboar:
+
+- Install the latest official distribution from `PyPi
+  <https://pypi.org/project/wildboar>`_. This is the recommended approach for
+  most users.
+
+- Build and compile the package from source. This provides the fastest binaries
+  targeted for the specific platform.
+
+Binary distributions are automatically built for macOS, GNU/Linux and Windows.
+The binaries can be installed through `PyPi <https://pypi.org/project/wildboar>`_.
+
+.. code-block:: shell
+
+   pip install wildboar
+
+If you are on a system where users don't have write-accesses to the location of
+Python packages, the distribution can be installed in the user directory.
+
+.. code-block:: shell
+
+   pip install --user wildboar
+
+You can also specify a specific version to install by replacing `wildboar` with,
+e.g., `wildboar==1.2.0` where `1.2.0` is the version to install.
+
+To avoid conflicts with already installed packages, it is strongly recommended
+installing the package in a
+`virtual environment <https://docs.python.org/3/tutorial/venv.html>`_. You can set
+up a virtual environment using `venv`.
+
+.. code-block:: shell
+
+   python3 -m venv .venv # create a virtual environment in the folder .venv
+   source .venv/bin/activate
+   pip install wildboar
+
+.. note::
+
+   Depending on your operating system, there are some possible ceavets. While its
+   outside the scope of this documentation to enumerate all of these, we have
+   collected a few common issues here.
+
+.. tab:: Debian
+
+   For Debian based distributions `python3-venv` must be installed for virtual
+   environments to work.
+
+   .. code-block:: shell
+
+      apt install python3-venv
+
+.. tab:: MacOS
+
+   For users of MacOS it is recommended to install python using
+   `Homebrew <https://brew.sh/>`_
+
+   .. code-block:: shell
+
+      brew install python
+
+.. tab:: Windows
+
+   For users of Windows it is recommended to use
+   `Anaconda <https://docs.conda.io/en/latest>`_ or
+   `Miniconda <https://docs.conda.io/en/latest/miniconda.html>`_. ``wildboar`` is
+   still installed using ``pip``
+
+.. toctree::
+  :maxdepth: 3
+  :hidden:
+
+  install/build