Skip to content

Commit

Permalink
Merge pull request #25 from esm-tools/feat/pipelines
Browse files Browse the repository at this point in the history
Runnable Pipelines
  • Loading branch information
pgierz authored Aug 7, 2024
2 parents 3445e26 + d88c748 commit 6ef31a4
Show file tree
Hide file tree
Showing 38 changed files with 1,267 additions and 432 deletions.
5 changes: 4 additions & 1 deletion doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,13 +75,16 @@
"python": ("http://docs.python.org/", None),
"numpy": ("http://docs.scipy.org/doc/numpy/", None),
"scipy": ("http://docs.scipy.org/doc/scipy/reference/", None),
"matplotlib": ("http://matplotlib.sourceforge.net/", None),
"matplotlib": ("https://matplotlib.org/stable/", None),
"pandas": ("http://pandas.pydata.org/pandas-docs/stable/", None),
"xarray": ("http://xarray.pydata.org/en/stable/", None),
"chemicals": ("https://chemicals.readthedocs.io/", None),
}


# -- Custom directives --------------------------------------------------------
napoleon_custom_sections = [("Mutates", "params_style")]

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

Expand Down
277 changes: 247 additions & 30 deletions doc/developer_guide.rst
Original file line number Diff line number Diff line change
@@ -1,43 +1,260 @@
Developer Guide
===============

Thanks for helping develop ``pymorize``!. This document will guide you through
Thanks for helping develop ``pymorize``! This document will guide you through
the code structure and layout, and provide a few tips on how to contribute.

Code Layout
-----------
Getting Started
---------------
To get started, you should clone the repository and install the dependencies. We give
a few extra dependencies for testing and documentation, so you should install these as well::

We use a `src` layout, with all files living under `./src/pymorize`. The code is
git clone https://github.com/esm-tools/pymorize.git
cd pymorize
pip install -e ".[dev,doc]"

This will install the package in "editable" mode, so you can make changes to the code. The
``dev`` and ``doc`` extras will install the dependencies needed for testing and documentation,
respectively. Before changing anything, make sure that the tests are passing::

pytest

Next, you should familiarize yourself with the code layout and the main building blocks. These
are described in the next section.


Code Layout and Main Classes
----------------------------

We use a ``src`` layout, with all files living under ``./src/pymorize``. The code is
generally divided into several building blocks:

There are a few main modules and classes you should be aware of:
* :py:class:`~pymorize.rule.Rule` is the main class that defines a rule for processing a CMOR variable.

* :py:class:`~pymorize.pipeline.Pipeline` (or an object inherited from :py:class:`~pymorize.pipeline.Pipeline`) is a collection
of actions that can be applied to a set of files described by a :py:class:`~pymorize.rule.Rule`. A few default pipelines are
provided, and you can also define your own.

* :py:class:`~pymorize.cmorizer.CMORizer` is responsible for reading in the rules, and managing the various
objects.

* ``Rule`` is the main class that defines a rule for processing a CMOR variable. It is
defined in ``rule.py``. ``Rule`` objects have several attributes which are useful:
:py:class:`~pymorize.rule.Rule` Class
-------------------------------------

This is the main building block for handling a set of files produced by a model. It has the following attributes:

1. ``input_patterns``: A list of regular expressions that are used to match the
input file name. Note that this is **regex**, not globbing!
2. ``cmor_variable``: The ``CMOR`` name of the variable.
3. ``actions``: A list of actions to be performed on the input file. An ``action`` is
a partially-applied function, where the only remaining argument is the input file.

**Design Note**: The reason for this design is that it allows us to define a
list of actions that can be applied to a file, and then apply them all at once
when we find a match via a loop. It *has not* yet been determined if what we will
return here. It is probably easiest to return a list of `xarray` objects, corresponding
to each action. This allows us to grab out the data we need at any step in the chain.

* ``Pipeline``: (**NOT YET IMPLEMENTED**) A ``Pipeline`` is a collection of rules that can
be applied to a file. It is (will be) defined in ``pipeline.py``.

* ``CMORizer``: This is the main class, if anything can be called that. It is defined in
``cmorizer.py``. It is responsible for reading in the rules, and managing the various
objects. **NOTE**: A ``CMORIZER`` object should **not** have any knowledge about files,
just about rules and actions. This is because we want to be able to test the rules
without having to worry about files (separation of concerns). This design is of course
subject to change. The ``CMORIZER`` object should have a method ``apply_rules`` that
takes a list of files, and applies the rules to them. (Alternatively, it should be a callable,
Paul hasn't decided yet)

``rule.py`` contains the main ``Rule`` class. It should be used to define a matching
between an output file and a CMOR name. d rules. This list is used to match the
3. ``pipelines``: A list of pipeline names that should be applied to the data.

Any other attributes can be added to the rule, and will appear in the ``rule_spec`` as attributes of the ``Rule`` object. In YAML, a minimal rule
looks like this:

.. code-block:: yaml
input_patterns: [".*"]
cmor_variable: tas
pipelines: [My Pipeline]
:py:class:`~pymorize.pipeline.Pipeline` Class
---------------------------------------------

The :py:class:`~pymorize.pipeline.Pipeline` class is a collection of actions that can be applied to a set of files. It should have a
``name`` attribute that describes the pipeline. If not given during construction, a random one is generated. The actions are stored in a list, and
are applied in order. There are a few ways to construct a pipeline. You can either create one from a list of actions (also called steps)::

>>> pipeline = pymorize.pipeline.Pipeline([action1, action2], name="My Pipeline")
>>> # Or use the class method:
>>> pl = Pipeline.from_list([action1, action2], name="My Pipeline")

where ``action1`` and ``action2`` are functions that follow the pipeline step protocol. See :ref:`the guide on building actions <building-actions-for-pipelines>`
for more information.

Another way to build actions is from a list of qualified names of functions. A class method is provided to do this easily::

>>> my_pipeline = Pipeline.from_qualnames(["my_module.my_action1", "my_module.my_action2"], name="My Pipeline")



:py:class:`~pymorize.cmorizer.CMORizer` Class
---------------------------------------------

The :py:class:`~pymorize.cmorizer.CMORizer` class is responsible for managing the rules and pipelines. It contains four configuration dictionaries:

1. ``pymorize_cfg``: This is the configuration for the ``pymorize`` package. It should contain a version number, and any other configuration
that is needed for the package to run. This is used to check that the configuration is correct for the specific version of ``pymorize``. You
can also specify certain features to be enabled or disabled here, as well as configure the logging.

2. ``global_cfg``: This is the global configuration for the rules and pipelines. This is used for configuration that is common to all rules and pipelines,
such as the path to the CMOR tables, or the path to the output directory. This is used to set up the environment for the rules and pipelines.

3. ``pipelines``: This is a list of :py:class:`~pymorize.pipeline.Pipeline` objects that are used to process the data. These are the pipelines that are
applied to the data, and are referenced by the rules. Each pipeline should have a unique name, and a series of steps to perform. You can also specify
"frozen" arguments and key-word arguments to apply to steps in the pipeline's configuration.

4. ``rules``: This is a list of :py:class:`~pymorize.rule.Rule` objects that are used to match the data. Each rule should have a unique name, and a series of
input patterns, a CMOR variable name, and a list of pipelines to apply to the data. You can also specify additional attributes that are used in the actions
in the pipelines.

.. _building-actions-for-pipelines:

Building Actions for Pipelines
------------------------------

When defining actions for a :py:class:`~pymorize.pipeline.Pipeline`, you should create functions
with the following signature::

def my_action(data: Any,
rule_spec: pymorize.rule.Rule,
cmorizer: pymorize.cmorizer.CMORizer,
*args, **kwargs) -> Any:
...
return data

The ``data`` argument is the data that is passed from one action to the next. The ``rule_spec`` is the
instance of the :py:class:`~pymorize.rule.Rule` class that is currently being evaluated. The ``cmorizer``
is the instance of the :py:class:`~pymorize.cmorizer.CMORizer` class that is managing the pipeline. You
can pass additional arguments to the action by using ``*args`` and ``**kwargs``, however most arguments or
keyword arguments should be extracted from the ``rule_spec``. The action should return the data that will be
passed to the next action in the pipeline. Note that the data can be any type, but it should be the same type
as what is expected in the next action in the pipeline.

.. note::

If needed, you can construct "conversion" actions that will convert the data from one type to another and pass
it to the next step.

When defining actions, you should also add a docstring that describes what the action does. This will be printed
when the user asks for help on the action. Note that whenever possible, you should use the ``rule_spec`` to pass
information into your action, rather than hardcoding it or passing in arguments. You can also use additional arguments
if needed, and these can be fixed to always use the same values for the entire pipeline the action belongs to, or,
alternatively, to the rule that the action is a part of. A few illustrative examples may make this clearer.

* Example 1: A simple action that adds 1 to the data::
def add_one(data: Any, rule_spec: pymorize.rule.Rule, cmorizer: pymorize.cmorizer.CMORizer) -> Any:
"""Add one to the data."""
return data + 1

Using this in a pipeline would look like this in Python code::

pipeline = pymorize.pipeline.Pipeline([add_one], name="Add One")
rule_spec = pymorize.rule.Rule(input_patterns=[".*"], cmor_variable="tas", pipelines=["Add One"])
cmorizer = pymorize.cmorizer.CMORizer(pymorize_cfg={"version": "unreleased"}, global_cfg={}, rules=[rule_spec], pipelines=[pipeline])
initial_data = 1
data = pipeline.run(initial_data, rule_spec, cmorizer)

In yaml, the same pipeline and configuration looks like this:

.. code-block:: yaml

pymorize:
version: unreleased

general:

pipelines:
- name: Add One
actions:
- add_one
rules:
- input_patterns: [".*"]
cmor_variable: tas
pipelines: [Add One]

* Example 2: An action that sets an attribute on a :py:class:`xarray.Dataset`, where this is specified in
the rule specification::

def set_attribute(data: xr.Dataset, rule_spec: pymorize.rule.Rule, cmorizer: pymorize.cmorizer.CMORizer) -> xr.Dataset:
"""Set an attribute on the dataset."""
data.attrs[rule_spec.attribute_name] = rule_spec.attribute_value
return data

Using this in a pipeline would look like this in yaml:

.. code-block:: yaml
pymorize:
version: unreleased
general:
pipelines:
- name: Set Attribute
actions:
- set_attribute
rules:
- input_patterns: [".*"]
cmor_variable: tas
pipelines: [Set Attribute]
attribute_name: "my_attribute"
attribute_value: "my_value"
* Example 3: An action that sets an attribute on a :py:class:`~xarray.Dataset`, where this is specified in the :py:class:`~pymorize.pipeline.Pipeline`.

It is the responsibility of the action developer to ensure arguments are passed correctly and have sensible values. This is a more complicated example. Here we check
if the rule has a specific attribute that matches the action's name, with "``_args``" appended. We use those values if that is the case. Otherwise, they can be obtained from
the pipeline, and default to empty strings. As an action developer, you need to ensure sensible logic here!

.. code-block::
def set_attribute(data: xr.Dataset, rule_spec: pymorize.rule.Rule, cmorizer: pymorize.cmorizer.CMORizer, attribute_name: str = "", attribute_value: str = "", *args, **kwargs) -> xr.Dataset:
"""Set an attribute on the dataset."""
if hasattr(rule_spec, f"{__name__}_args"):
attribute_name = getattr(rule_spec, f"{__name__}_args").get("attribute_name", my_attribute)
attribute_value = getattr(rule_spec, f"{__name__}_args").get("attribute_value", my_value)
data.attrs[attribute_name] = attribute_value
return data
Using this in a pipeline would look like this in yaml:

.. code-block:: yaml
pymorize:
version: unreleased
general:
pipelines:
- name: Set Attribute
actions:
- set_attribute
attribute_name: "my_attribute"
attribute_value: "my_value"
rules:
- input_patterns: [".*"]
cmor_variable: tas
pipelines: [Set Attribute]
.. important::

In the case of passing arguments that are *not* in the rule spec, you need to be careful about where you place the information. The :py:class:`~pymorize.rule.Rule` should win, if
there are conflicts between the rule and the pipeline. This is because the rule is the most specific, and the pipeline is the most general. So, to have a value specified in
the rule, you should do:

.. code-block:: yaml
pymorize:
version: unreleased
general:
pipelines:
- name: Set Attribute
actions:
- set_attribute
attribute_name: "my_attribute"
attribute_value: "my_value"
rules:
- input_patterns: [".*"]
cmor_variable: tas
pipelines: [Set Attribute]
set_attribute_args:
attribute_name: "my_other_attribute"
attribute_value: "my_other_value"
.. attention::

If you want more examples in the handbook, please open an issue or a pull request!
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Contents
:maxdepth: 2

installation
pymorize_building_blocks
pymorize_config_file
including_subcommand_plugins
developer_guide
Expand Down
72 changes: 72 additions & 0 deletions doc/pymorize_building_blocks.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
=====================================
Usage: ``pymorize``'s Building Blocks
=====================================

The ``pymorize`` CLI has a few basic concepts you should be familiar with before you start using it. This guide
will give you a brief overview of these concepts. Suggestions for improvements are always welcome!

Configuration
-------------

The configuration is the central piece of the ``pymorize`` CLI. It is a YAML file that specifies the behavior of
the CLI. The configuration file is divided into four sections:

1. ``pymorize``: This section contains the configuration for the CLI itself. It specifies the program version, log verbosity, the location of the user configuration file, and the location of the log file.
2. ``global``: This section contains information that will be passed to all pipelines. You will specify the location of the data directory (your model output files),
the output directory (where re-written data should be stored), the location of the CMOR tables, the location of your model's geometry description file (or files), and
any other information that may be needed by all pipelines.
3. ``pipelines``: This section contains the configuration for the pipelines. Each pipeline is a sequence of operations that will be applied to the data. You can specify the name of the pipeline, the class
that implements the pipeline, and any parameters that the pipeline needs.
4. ``rules``: This section contains the configuration for the rules. Each rule describes a set of files needed to produce a CMOR output variable. You must specify the CMOR variable of interest, the input
patterns to use to find the files, and the pipeline(s) to apply to the files.

Pipelines
---------

:py:class:`~pymorize.pipeline.Pipeline`'s come in two flavors. The first is a predefined pipeline, which is
attached to the configuration via a ``uses`` directive. In the user configuration file, you would specify it
like this:

.. code-block:: yaml
# ... other configuration
pipelines:
- name: my_pipeline
uses: pymorize.pipeline.DefaultPipeline
# ... other configuration
Alternatively you can define your own pipeline by specifying the steps it should take. Here is an example of a
custom pipeline:

.. code-block:: yaml
# ... other configuration
pipelines:
- name: my_pipeline
steps:
- pymorize.generic.dummy_load_data
- pymorize.generic.dummy_process_data
- pymorize.generic.dummy_save_data
# ... other configuration
Rules
-----

Rules are the heart of the ``pymorize`` CLI. They specify the files needed to produce a CMOR output variable. Each rule has a name, a CMOR variable, and a list of input patterns. The input patterns are used to find the files needed to produce the CMOR output variable. Here is an example of a rule:

.. code-block:: yaml
# ... other configuration
rules:
- name: my_rule
cmor_variable: tas
patterns:
- 'tas_*.nc'
pipelines:
- my_pipeline
# ... other configuration
.. note::

If you do not specify a pipeline, the default pipeline will be run!

1 change: 1 addition & 0 deletions doc/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ sphinx-copybutton
sphinx-tabs
sphinx-toolbox
sphinx-rtd-theme
watchdog[watchmedo]
Loading

0 comments on commit 6ef31a4

Please sign in to comment.