Skip to content

Commit

Permalink
[0.0.1] First commit
Browse files Browse the repository at this point in the history
  • Loading branch information
martinferianc committed Feb 12, 2024
1 parent e5a529b commit 5c05032
Show file tree
Hide file tree
Showing 152 changed files with 33,782 additions and 2 deletions.
19 changes: 19 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
version: 2

build:
os: ubuntu-20.04
tools:
python: "3.9"

formats:
- htmlzip

sphinx:
builder: html
configuration: docs/conf.py
fail_on_warning: true

python:
install:
- requirements: docs/requirements.txt
- requirements: requirements.txt
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"esbonio.sphinx.confDir": ""
}
2 changes: 0 additions & 2 deletions README.md

This file was deleted.

119 changes: 119 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
YAMLE (Yet Another Machine Learning Environment)
================================================

.. image:: https://img.shields.io/badge/License-MIT-blue.svg
:target: https://opensource.org/licenses/MIT
:alt: License: MIT
.. image:: https://img.shields.io/badge/Python-3.9-blue.svg
:target: https://www.python.org/downloads/release/python-390/
:alt: Python: 3.9
.. image:: https://img.shields.io/badge/PyTorch-2.0.0-blue.svg
:target: https://pytorch.org/
:alt: PyTorch: 2.0.0
.. image:: https://img.shields.io/badge/PyTorch%20Lightning-2.0.8-blue.svg
:target: https://www.pytorchlightning.ai/
:alt: PyTorch Lightning: 2.0.8

Overview
--------

YAMLE: Yet Another Machine Learning Environment is an open-source framework that facilitates rapid prototyping and experimentation with machine learning models and methods. The key motivation is to reduce repetitive work when implementing new approaches and improve reproducibility in ML research. YAMLE includes a command-line interface and integrations with popular and well-maintained PyTorch-based libraries to streamline training, hyperparameter optimisation, and logging. The ambition for YAMLE is to grow into a shared ecosystem where researchers and practitioners can quickly build on and compare existing implementations.

You can find the YAMLE repository on GitHub: `YAMLE Repository <https://github.com/martinferianc/yamle>`_

Table of Contents
-----------------

- `Introduction`_
- `Core Components and Modules`_
- `Use Cases and Applications`_
- `Future Development Roadmap`_
- `Citation`_

Introduction
------------

This repository introduces -- an open-source generalist customisable experiment environment with boilerplate code already implemented for rapid prototyping with ML models and methods. The main features of the environment are summarised as follows:

- **Modular Design**: The environment is divided into three main components - data, models, and methods - which are infrastructurally connected but can be independently modified and extended. The goal is to write a method or a model and then seamlessly use it across different models or methods across different datasets and tasks.

- **Command-line Interface**: The environment includes a command-line interface for easy configuration of all hyperparameters and training of models.

- **Hyperparameter Optimisation**: YAMLE is integrated with syne-tune for hyperparameter optimisation.

- **Logging**: YAMLE is integrated with TensorBoard for logging and visualization of training, validation, and test metrics.

- **End-to-End Experiments**: YAMLE enables end-to-end experiments, covering data preprocessing, model training, and evaluation. All settings are recorded for reproducibility.

Core Components and Modules
---------------------------

YAMLE is built on PyTorch and PyTorch Lightning and relies on torchmetrics for evaluation metrics and syne-tune for hyperparameter optimisation. The framework is designed to provide an ecosystem for rapid prototyping and experimentation. The core components and modules of YAMLE include:

- :py:mod:`BaseDataModule <yamle.data.datamodule>`: Responsible for downloading, loading, and preprocessing data. It defines the task, data splitting, input/output dimensions, and more.

- :py:mod:`BaseModel <yamle.models.model>`: Defines the architecture of the model and its forward pass. It can be configured for different widths, depths, and activation functions.

- :py:mod:`BaseMethod <yamle.methods.method>`: Defines the training, validation, and test steps, as well as the loss function, optimiser, and regularization. It can also incorporate pruning and quantisation during evaluation.

These components are orchestrated by the :py:mod:`BaseTrainer <yamle.trainers.trainer>` class, which is responsible for executing training and evaluation loops and running on a specific device platform. YAMLE facilitates end-to-end experiments, from data preprocessing to model training and evaluation, by allowing users to customise these components through command-line arguments.

Use Cases and Applications
---------------------------

YAMLE is designed to serve as the template for the main project itself, allowing researchers and practitioners to conduct experiments, compare their models and methods, and easily extend the framework. The typical workflow for using YAMLE includes:

1. Clone the YAMLE repository and install dependencies.
2. Experiment with new methods or models by subclassing the :py:mod:`BaseModel <yamle.models.model>` or :py:mod:`BaseModel <yamle.models.model>` on the chosen :py:mod:`BaseDataModule <yamle.data.datamodule>` or any other customisable component.
3. When satisfied with your additions, contribute them to the repository via a pull request.
4. New additions will be reviewed and categorised as staple or experimental features, and YAMLE will be updated accordingly.

YAMLE currently supports three primary use cases:

- **Training**: Initiate model training using the command-line interface, specifying hyperparameters, datasets, and other settings.

e.g. ``python3 yamle/cli/train.py --method base --trainer_devices "[0]" --datamodule mnist --datamodule_batch_size 256 --method_optimizer adam --method_learning_rate 3e-4 --regularizer l2 --method_regularizer_weight 1e-5 --loss crossentropy --save_path ./experiments --trainer_epochs 3 --model_hidden_dim 32 --model_depth 3 --datamodule_validation_portion 0.1 --save_path ./experiments --datamodule_pad_to_32 1``

- **Testing**: Conduct testing to evaluate the performance of your models or methods.

e.g. ``python3 yamle/cli/test.py --method base --trainer_devices "[0]" --datamodule mnist --datamodule_batch_size 256 --loss crossentropy --save_path ./experiments --model_hidden_dim 32 --model_depth 3 --datamodule_validation_portion 0.1 --save_path ./experiments --datamodule_pad_to_32 1 --load_path ./experiments/<FOLDER>``

- **Hyperparameter Optimisation**: Optimise hyperparameters using syne-tune, a framework integrated into YAMLE for this purpose.

YAMLE allows users to quickly set up experiments, perform training, testing, and hyperparameter optimisation, covering the entire machine learning pipeline from data preprocessing to model evaluation.

e.g. ``python3 yamle/cli/tune.py --config_file <FILE_NAME> --optimiser "Grid Search" --save_path ./experiments/hpo/ --max_wallclock_time 420 --optimisation_metric "validation_nll"``


Future Development Roadmap
---------------------------

YAMLE is an evolving project, and there are several areas for future development and improvement:

- **Documentation**: Prioritising the creation of comprehensive documentation to make YAMLE more accessible to users.

- **Additional Tasks**: Expanding the range of problems supported by YAMLE, including unsupervised, self-supervised learning, and reinforcement learning tasks.

- **Expanding the Model Zoo**: Increasing the collection of models and methods for easy comparison with existing implementations.

- **Testing**: Implementing unit tests to ensure the reliability of the framework.

- **Multi-device Runs**: Extending support for multi-device training and testing.

- **Other Hyperparameter Optimisation Methods**: Including support for additional hyperparameter optimisation methods like Optuna and Ray Tune.

These improvements and extensions will enhance YAMLE's capabilities and make it an even more valuable tool for machine learning researchers and practitioners.

Citation
--------

If you use YAMLE in your research, please cite the following paper:

.. code-block:: bibtex
@article{ferianc2024yamle,
title={YAMLE: Yet Another Machine Learning Environment},
author={Ferianc, Martin and Rodrigues, Miguel},
journal={arXiv preprint arXiv:2402.06268},
year={2024}
}
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
118 changes: 118 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
import datetime
import os
import shutil
import sys

sys.path.insert(0, os.path.abspath("../yamle/"))

import yamle


def run_apidoc(app):
"""Generate doc stubs using sphinx-apidoc."""
module_dir = os.path.join(app.srcdir, "../yamle/")
output_dir = os.path.join(app.srcdir, "_apidoc")
excludes = []

# Ensure that any stale apidoc files are cleaned up first.
if os.path.exists(output_dir):
shutil.rmtree(output_dir)

cmd = [
"--separate",
"--module-first",
"--doc-project=API Reference",
"-o",
output_dir,
module_dir,
]
cmd.extend(excludes)

try:
from sphinx.ext import apidoc # Sphinx >= 1.7

apidoc.main(cmd)
except ImportError:
from sphinx import apidoc # Sphinx < 1.7

cmd.insert(0, apidoc.__file__)
apidoc.main(cmd)


def setup(app):
"""Register our sphinx-apidoc hook."""
app.connect("builder-inited", run_apidoc)


# Sphinx configuration below.
project = "YAMLE"
version = yamle.__version__
release = yamle.__version__
athor = "Martin Ferianc"
copyright = f"2023-{datetime.datetime.now().year}, Martin"


extensions = [
"sphinx.ext.autosectionlabel",
"sphinx.ext.napoleon",
"sphinx.ext.autodoc",
"sphinx_autodoc_typehints",
"sphinx.ext.doctest",
"sphinx.ext.intersphinx",
"sphinx.ext.todo",
"sphinx.ext.viewcode",
"sphinx.ext.coverage",
"hoverxref.extension",
"sphinx_copybutton",
"sphinxext.opengraph",
"sphinx_paramlinks",
]
coverage_show_missing_items = True

autosectionlabel_prefix_document = True

hoverxref_auto_ref = True
hoverxref_role_types = {"ref": "tooltip"}

source_suffix = [".rst", ".md"]

master_doc = "index"

autoclass_content = "class"
autodoc_member_order = "bysource"
default_role = "py:obj"

html_theme = "pydata_sphinx_theme"
html_sidebars = {"**": ["sidebar-nav-bs"]}
html_theme_options = {
"primary_sidebar_end": [],
"footer_start": ["copyright"],
"footer_end": [],
"icon_links": [
{
"name": "GitHub",
"url": "https://github.com/martinferianc/YAMLE",
"icon": "fa-brands fa-square-github",
"type": "fontawesome",
}
],
"use_edit_page_button": True,
"collapse_navigation": True,
}
html_context = {
"github_user": "martinferianc",
"github_repo": "YAMLE",
"github_version": "main",
"doc_path": "docs",
"default_mode": "light",
}

htmlhelp_basename = "{}doc".format(project)

napoleon_use_rtype = False

rst_prolog = """
.. role:: python(code)
:language: python
:class: highlight
"""
60 changes: 60 additions & 0 deletions docs/extending_yamle/datamodule.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
.. _extending_datamodule:

************************
Extending DataModule
************************

In this Tutorial we will demonstrate how to extend the :py:mod:`BaseDataModule <yamle.data.datamodule.BaseDataModule>` class to create a custom DataModule.

We will be adding or looking at how to add the MNIST dataset to YAMLE through a custom DataModule. MNIST is a dataset of handwritten digits, which is a popular dataset for testing image classification models. The dataset is available through the `torchvision <https://pytorch.org/vision/stable/datasets.html#mnist>`_ package.

To start an implementation of any datamodule we recommend to look at the :py:mod:`BaseDataModule <yamle.data.datamodule.BaseDataModule>` class. It has many arguments which can be used to customize the datamodule.

.. literalinclude:: ../../yamle/data/datamodule.py
:language: python
:lines: 36-60

This class also does already cointain a lot of useful functionality e.g. to do automatic splitting of the dataset to training, validation and calibration portions e.g. through the :py:meth:`setup <yamle.data.datamodule.BaseDataModule.setup>` method.

.. literalinclude:: ../../yamle/data/datamodule.py
:language: python
:pyobject: BaseDataModule.setup

Note that the :py:meth:`setup <yamle.data.datamodule.BaseDataModule.setup>` method wraps the datasets into a `SurrogateDataset <yamle.data.dataset_wrappers.SurrogateDataset>` which is a wrapper around the `torch.utils.data.Dataset <https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset>`_ class. This wrapper allows to manually control the data or the target transformations.

The transformations are generally managed through a :py:meth:`get_transform <yamle.data.datamodule.BaseDataModule.get_transform>` method which is being called for each dataset split: training, validation, calibration and testing.

Then there is the :py:meth:`prepare_data <yamle.data.datamodule.BaseDataModule.prepare_data>` method which is used to download the dataset. This method is only called once per machine and not per GPU. This is important to know if you want to download the dataset multiple times. The :py:meth:`prepare_data <yamle.data.datamodule.BaseDataModule.prepare_data>` method is called before the :py:meth:`setup <yamle.data.datamodule.BaseDataModule.setup>` method.

Now let's start with the implementation of the MNIST datamodule. In fact, many of the torchvision datasets can be processed in a similar way hence we will create two classes. One for general torchvision classification datasets and one concretely for MNIST.

The torchvision classification datamodule is implemented in :py:mod:`TorchvisionClassificationDataModule <yamle.data.classification.TorchvisionClassificationDataModule>`.

.. literalinclude:: ../../yamle/data/classification.py
:language: python
:pyobject: TorchvisionClassificationDataModule

It inherits from a :py:mod:`VisionClassificationDataModule <yamle.data.datamodule.VisionClassificationDataModule>` which implements useful methods for debugging and plotting of the predictions or the applied augmentations.

Any datamodule also allows specification of custom arguments e.g. the :code:`datamodule_pad_to_32` argument through :py:meth:`add_specific_args <yamle.data.datamodule.BaseDataModule.add_specific_args>`.

.. literalinclude:: ../../yamle/data/classification.py
:language: python
:pyobject: TorchvisionClassificationDataModule.add_specific_args

Note the :code:`datamodule_` prefix which is used to avoid name clashes with other arguments and separate the datamodule arguments from any other arguments.

The module can accept custom arguments such as :code:`pad_to_32` which can pad the image to a size of 32x32 pixels. This is useful if you want to use a model which requires a certain input size or to be used to apply out-ouf-distribution augmentations common in the field of out-of-distribution detection. Notice that, in practice the user only needs to fill in the :py:meth:`prepare_data <yamle.data.datamodule.BaseDataModule.prepare_data>` method which downloads the training or the test datasets and places them at the :py:attr:`_data_dir <yamle.data.datamodule.BaseDataModule._data_dir>` location. The :py:meth:`setup <yamle.data.datamodule.BaseDataModule.setup>` method is then used to wrap the datasets into a :py:class:`SurrogateDataset <yamle.data.dataset_wrappers.SurrogateDataset>` and to split the training dataset into training, validation and calibration portions.

Finally we create a concrete MNIST datamodule :py:mod:`TorchvisionClassificationDataModuleMNIST <yamle.data.classification.TorchvisionClassificationDataModuleMNIST>` which inherits from the :py:mod:`TorchvisionClassificationDataModule <yamle.data.classification.TorchvisionClassificationDataModule>`

.. literalinclude:: ../../yamle/data/classification.py
:language: python
:pyobject: TorchvisionClassificationDataModuleMNIST

Note that each end datamodule which implements a concrete dataset needs to specify the :py:attr:`inputs_dim <yamle.data.datamodule.BaseDataModule.inputs_dim>`, :py:attr:`outputs_dim <yamle.data.datamodule.BaseDataModule.outputs_dim>`, :py:attr:`targets_dim <yamle.data.datamodule.BaseDataModule.targets_dim>` and optionally :py:attr:`mean <yamle.data.datamodule.BaseDataModule.mean>` and :py:attr:`std <yamle.data.datamodule.BaseDataModule.std>` attributes. These attributes are used to normalize the data and to calculate the input and output dimensions of the model.

The last step is to register the new datamodule in the :py:mod:`__init__ <yamle.data.__init__>` module along all the other available datamodules.

.. literalinclude:: ../../yamle/data/__init__.py
:language: python
Loading

0 comments on commit 5c05032

Please sign in to comment.