-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #25 from esm-tools/feat/pipelines
Runnable Pipelines
- Loading branch information
Showing
38 changed files
with
1,267 additions
and
432 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,43 +1,260 @@ | ||
Developer Guide | ||
=============== | ||
|
||
Thanks for helping develop ``pymorize``!. This document will guide you through | ||
Thanks for helping develop ``pymorize``! This document will guide you through | ||
the code structure and layout, and provide a few tips on how to contribute. | ||
|
||
Code Layout | ||
----------- | ||
Getting Started | ||
--------------- | ||
To get started, you should clone the repository and install the dependencies. We give | ||
a few extra dependencies for testing and documentation, so you should install these as well:: | ||
|
||
We use a `src` layout, with all files living under `./src/pymorize`. The code is | ||
git clone https://github.com/esm-tools/pymorize.git | ||
cd pymorize | ||
pip install -e ".[dev,doc]" | ||
|
||
This will install the package in "editable" mode, so you can make changes to the code. The | ||
``dev`` and ``doc`` extras will install the dependencies needed for testing and documentation, | ||
respectively. Before changing anything, make sure that the tests are passing:: | ||
|
||
pytest | ||
|
||
Next, you should familiarize yourself with the code layout and the main building blocks. These | ||
are described in the next section. | ||
|
||
|
||
Code Layout and Main Classes | ||
---------------------------- | ||
|
||
We use a ``src`` layout, with all files living under ``./src/pymorize``. The code is | ||
generally divided into several building blocks: | ||
|
||
There are a few main modules and classes you should be aware of: | ||
* :py:class:`~pymorize.rule.Rule` is the main class that defines a rule for processing a CMOR variable. | ||
|
||
* :py:class:`~pymorize.pipeline.Pipeline` (or an object inherited from :py:class:`~pymorize.pipeline.Pipeline`) is a collection | ||
of actions that can be applied to a set of files described by a :py:class:`~pymorize.rule.Rule`. A few default pipelines are | ||
provided, and you can also define your own. | ||
|
||
* :py:class:`~pymorize.cmorizer.CMORizer` is responsible for reading in the rules, and managing the various | ||
objects. | ||
|
||
* ``Rule`` is the main class that defines a rule for processing a CMOR variable. It is | ||
defined in ``rule.py``. ``Rule`` objects have several attributes which are useful: | ||
:py:class:`~pymorize.rule.Rule` Class | ||
------------------------------------- | ||
|
||
This is the main building block for handling a set of files produced by a model. It has the following attributes: | ||
|
||
1. ``input_patterns``: A list of regular expressions that are used to match the | ||
input file name. Note that this is **regex**, not globbing! | ||
2. ``cmor_variable``: The ``CMOR`` name of the variable. | ||
3. ``actions``: A list of actions to be performed on the input file. An ``action`` is | ||
a partially-applied function, where the only remaining argument is the input file. | ||
|
||
**Design Note**: The reason for this design is that it allows us to define a | ||
list of actions that can be applied to a file, and then apply them all at once | ||
when we find a match via a loop. It *has not* yet been determined if what we will | ||
return here. It is probably easiest to return a list of `xarray` objects, corresponding | ||
to each action. This allows us to grab out the data we need at any step in the chain. | ||
|
||
* ``Pipeline``: (**NOT YET IMPLEMENTED**) A ``Pipeline`` is a collection of rules that can | ||
be applied to a file. It is (will be) defined in ``pipeline.py``. | ||
|
||
* ``CMORizer``: This is the main class, if anything can be called that. It is defined in | ||
``cmorizer.py``. It is responsible for reading in the rules, and managing the various | ||
objects. **NOTE**: A ``CMORIZER`` object should **not** have any knowledge about files, | ||
just about rules and actions. This is because we want to be able to test the rules | ||
without having to worry about files (separation of concerns). This design is of course | ||
subject to change. The ``CMORIZER`` object should have a method ``apply_rules`` that | ||
takes a list of files, and applies the rules to them. (Alternatively, it should be a callable, | ||
Paul hasn't decided yet) | ||
|
||
``rule.py`` contains the main ``Rule`` class. It should be used to define a matching | ||
between an output file and a CMOR name. d rules. This list is used to match the | ||
3. ``pipelines``: A list of pipeline names that should be applied to the data. | ||
|
||
Any other attributes can be added to the rule, and will appear in the ``rule_spec`` as attributes of the ``Rule`` object. In YAML, a minimal rule | ||
looks like this: | ||
|
||
.. code-block:: yaml | ||
input_patterns: [".*"] | ||
cmor_variable: tas | ||
pipelines: [My Pipeline] | ||
:py:class:`~pymorize.pipeline.Pipeline` Class | ||
--------------------------------------------- | ||
|
||
The :py:class:`~pymorize.pipeline.Pipeline` class is a collection of actions that can be applied to a set of files. It should have a | ||
``name`` attribute that describes the pipeline. If not given during construction, a random one is generated. The actions are stored in a list, and | ||
are applied in order. There are a few ways to construct a pipeline. You can either create one from a list of actions (also called steps):: | ||
|
||
>>> pipeline = pymorize.pipeline.Pipeline([action1, action2], name="My Pipeline") | ||
>>> # Or use the class method: | ||
>>> pl = Pipeline.from_list([action1, action2], name="My Pipeline") | ||
|
||
where ``action1`` and ``action2`` are functions that follow the pipeline step protocol. See :ref:`the guide on building actions <building-actions-for-pipelines>` | ||
for more information. | ||
|
||
Another way to build actions is from a list of qualified names of functions. A class method is provided to do this easily:: | ||
|
||
>>> my_pipeline = Pipeline.from_qualnames(["my_module.my_action1", "my_module.my_action2"], name="My Pipeline") | ||
|
||
|
||
|
||
:py:class:`~pymorize.cmorizer.CMORizer` Class | ||
--------------------------------------------- | ||
|
||
The :py:class:`~pymorize.cmorizer.CMORizer` class is responsible for managing the rules and pipelines. It contains four configuration dictionaries: | ||
|
||
1. ``pymorize_cfg``: This is the configuration for the ``pymorize`` package. It should contain a version number, and any other configuration | ||
that is needed for the package to run. This is used to check that the configuration is correct for the specific version of ``pymorize``. You | ||
can also specify certain features to be enabled or disabled here, as well as configure the logging. | ||
|
||
2. ``global_cfg``: This is the global configuration for the rules and pipelines. This is used for configuration that is common to all rules and pipelines, | ||
such as the path to the CMOR tables, or the path to the output directory. This is used to set up the environment for the rules and pipelines. | ||
|
||
3. ``pipelines``: This is a list of :py:class:`~pymorize.pipeline.Pipeline` objects that are used to process the data. These are the pipelines that are | ||
applied to the data, and are referenced by the rules. Each pipeline should have a unique name, and a series of steps to perform. You can also specify | ||
"frozen" arguments and key-word arguments to apply to steps in the pipeline's configuration. | ||
|
||
4. ``rules``: This is a list of :py:class:`~pymorize.rule.Rule` objects that are used to match the data. Each rule should have a unique name, and a series of | ||
input patterns, a CMOR variable name, and a list of pipelines to apply to the data. You can also specify additional attributes that are used in the actions | ||
in the pipelines. | ||
|
||
.. _building-actions-for-pipelines: | ||
|
||
Building Actions for Pipelines | ||
------------------------------ | ||
|
||
When defining actions for a :py:class:`~pymorize.pipeline.Pipeline`, you should create functions | ||
with the following signature:: | ||
|
||
def my_action(data: Any, | ||
rule_spec: pymorize.rule.Rule, | ||
cmorizer: pymorize.cmorizer.CMORizer, | ||
*args, **kwargs) -> Any: | ||
... | ||
return data | ||
|
||
The ``data`` argument is the data that is passed from one action to the next. The ``rule_spec`` is the | ||
instance of the :py:class:`~pymorize.rule.Rule` class that is currently being evaluated. The ``cmorizer`` | ||
is the instance of the :py:class:`~pymorize.cmorizer.CMORizer` class that is managing the pipeline. You | ||
can pass additional arguments to the action by using ``*args`` and ``**kwargs``, however most arguments or | ||
keyword arguments should be extracted from the ``rule_spec``. The action should return the data that will be | ||
passed to the next action in the pipeline. Note that the data can be any type, but it should be the same type | ||
as what is expected in the next action in the pipeline. | ||
|
||
.. note:: | ||
|
||
If needed, you can construct "conversion" actions that will convert the data from one type to another and pass | ||
it to the next step. | ||
|
||
When defining actions, you should also add a docstring that describes what the action does. This will be printed | ||
when the user asks for help on the action. Note that whenever possible, you should use the ``rule_spec`` to pass | ||
information into your action, rather than hardcoding it or passing in arguments. You can also use additional arguments | ||
if needed, and these can be fixed to always use the same values for the entire pipeline the action belongs to, or, | ||
alternatively, to the rule that the action is a part of. A few illustrative examples may make this clearer. | ||
|
||
* Example 1: A simple action that adds 1 to the data:: | ||
def add_one(data: Any, rule_spec: pymorize.rule.Rule, cmorizer: pymorize.cmorizer.CMORizer) -> Any: | ||
"""Add one to the data.""" | ||
return data + 1 | ||
|
||
Using this in a pipeline would look like this in Python code:: | ||
|
||
pipeline = pymorize.pipeline.Pipeline([add_one], name="Add One") | ||
rule_spec = pymorize.rule.Rule(input_patterns=[".*"], cmor_variable="tas", pipelines=["Add One"]) | ||
cmorizer = pymorize.cmorizer.CMORizer(pymorize_cfg={"version": "unreleased"}, global_cfg={}, rules=[rule_spec], pipelines=[pipeline]) | ||
initial_data = 1 | ||
data = pipeline.run(initial_data, rule_spec, cmorizer) | ||
|
||
In yaml, the same pipeline and configuration looks like this: | ||
|
||
.. code-block:: yaml | ||
|
||
pymorize: | ||
version: unreleased | ||
|
||
general: | ||
|
||
pipelines: | ||
- name: Add One | ||
actions: | ||
- add_one | ||
rules: | ||
- input_patterns: [".*"] | ||
cmor_variable: tas | ||
pipelines: [Add One] | ||
|
||
* Example 2: An action that sets an attribute on a :py:class:`xarray.Dataset`, where this is specified in | ||
the rule specification:: | ||
|
||
def set_attribute(data: xr.Dataset, rule_spec: pymorize.rule.Rule, cmorizer: pymorize.cmorizer.CMORizer) -> xr.Dataset: | ||
"""Set an attribute on the dataset.""" | ||
data.attrs[rule_spec.attribute_name] = rule_spec.attribute_value | ||
return data | ||
|
||
Using this in a pipeline would look like this in yaml: | ||
|
||
.. code-block:: yaml | ||
pymorize: | ||
version: unreleased | ||
general: | ||
pipelines: | ||
- name: Set Attribute | ||
actions: | ||
- set_attribute | ||
rules: | ||
- input_patterns: [".*"] | ||
cmor_variable: tas | ||
pipelines: [Set Attribute] | ||
attribute_name: "my_attribute" | ||
attribute_value: "my_value" | ||
* Example 3: An action that sets an attribute on a :py:class:`~xarray.Dataset`, where this is specified in the :py:class:`~pymorize.pipeline.Pipeline`. | ||
|
||
It is the responsibility of the action developer to ensure arguments are passed correctly and have sensible values. This is a more complicated example. Here we check | ||
if the rule has a specific attribute that matches the action's name, with "``_args``" appended. We use those values if that is the case. Otherwise, they can be obtained from | ||
the pipeline, and default to empty strings. As an action developer, you need to ensure sensible logic here! | ||
|
||
.. code-block:: | ||
def set_attribute(data: xr.Dataset, rule_spec: pymorize.rule.Rule, cmorizer: pymorize.cmorizer.CMORizer, attribute_name: str = "", attribute_value: str = "", *args, **kwargs) -> xr.Dataset: | ||
"""Set an attribute on the dataset.""" | ||
if hasattr(rule_spec, f"{__name__}_args"): | ||
attribute_name = getattr(rule_spec, f"{__name__}_args").get("attribute_name", my_attribute) | ||
attribute_value = getattr(rule_spec, f"{__name__}_args").get("attribute_value", my_value) | ||
data.attrs[attribute_name] = attribute_value | ||
return data | ||
Using this in a pipeline would look like this in yaml: | ||
|
||
.. code-block:: yaml | ||
pymorize: | ||
version: unreleased | ||
general: | ||
pipelines: | ||
- name: Set Attribute | ||
actions: | ||
- set_attribute | ||
attribute_name: "my_attribute" | ||
attribute_value: "my_value" | ||
rules: | ||
- input_patterns: [".*"] | ||
cmor_variable: tas | ||
pipelines: [Set Attribute] | ||
.. important:: | ||
|
||
In the case of passing arguments that are *not* in the rule spec, you need to be careful about where you place the information. The :py:class:`~pymorize.rule.Rule` should win, if | ||
there are conflicts between the rule and the pipeline. This is because the rule is the most specific, and the pipeline is the most general. So, to have a value specified in | ||
the rule, you should do: | ||
|
||
.. code-block:: yaml | ||
pymorize: | ||
version: unreleased | ||
general: | ||
pipelines: | ||
- name: Set Attribute | ||
actions: | ||
- set_attribute | ||
attribute_name: "my_attribute" | ||
attribute_value: "my_value" | ||
rules: | ||
- input_patterns: [".*"] | ||
cmor_variable: tas | ||
pipelines: [Set Attribute] | ||
set_attribute_args: | ||
attribute_name: "my_other_attribute" | ||
attribute_value: "my_other_value" | ||
.. attention:: | ||
|
||
If you want more examples in the handbook, please open an issue or a pull request! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
===================================== | ||
Usage: ``pymorize``'s Building Blocks | ||
===================================== | ||
|
||
The ``pymorize`` CLI has a few basic concepts you should be familiar with before you start using it. This guide | ||
will give you a brief overview of these concepts. Suggestions for improvements are always welcome! | ||
|
||
Configuration | ||
------------- | ||
|
||
The configuration is the central piece of the ``pymorize`` CLI. It is a YAML file that specifies the behavior of | ||
the CLI. The configuration file is divided into four sections: | ||
|
||
1. ``pymorize``: This section contains the configuration for the CLI itself. It specifies the program version, log verbosity, the location of the user configuration file, and the location of the log file. | ||
2. ``global``: This section contains information that will be passed to all pipelines. You will specify the location of the data directory (your model output files), | ||
the output directory (where re-written data should be stored), the location of the CMOR tables, the location of your model's geometry description file (or files), and | ||
any other information that may be needed by all pipelines. | ||
3. ``pipelines``: This section contains the configuration for the pipelines. Each pipeline is a sequence of operations that will be applied to the data. You can specify the name of the pipeline, the class | ||
that implements the pipeline, and any parameters that the pipeline needs. | ||
4. ``rules``: This section contains the configuration for the rules. Each rule describes a set of files needed to produce a CMOR output variable. You must specify the CMOR variable of interest, the input | ||
patterns to use to find the files, and the pipeline(s) to apply to the files. | ||
|
||
Pipelines | ||
--------- | ||
|
||
:py:class:`~pymorize.pipeline.Pipeline`'s come in two flavors. The first is a predefined pipeline, which is | ||
attached to the configuration via a ``uses`` directive. In the user configuration file, you would specify it | ||
like this: | ||
|
||
.. code-block:: yaml | ||
# ... other configuration | ||
pipelines: | ||
- name: my_pipeline | ||
uses: pymorize.pipeline.DefaultPipeline | ||
# ... other configuration | ||
Alternatively you can define your own pipeline by specifying the steps it should take. Here is an example of a | ||
custom pipeline: | ||
|
||
.. code-block:: yaml | ||
# ... other configuration | ||
pipelines: | ||
- name: my_pipeline | ||
steps: | ||
- pymorize.generic.dummy_load_data | ||
- pymorize.generic.dummy_process_data | ||
- pymorize.generic.dummy_save_data | ||
# ... other configuration | ||
Rules | ||
----- | ||
|
||
Rules are the heart of the ``pymorize`` CLI. They specify the files needed to produce a CMOR output variable. Each rule has a name, a CMOR variable, and a list of input patterns. The input patterns are used to find the files needed to produce the CMOR output variable. Here is an example of a rule: | ||
|
||
.. code-block:: yaml | ||
# ... other configuration | ||
rules: | ||
- name: my_rule | ||
cmor_variable: tas | ||
patterns: | ||
- 'tas_*.nc' | ||
pipelines: | ||
- my_pipeline | ||
# ... other configuration | ||
.. note:: | ||
|
||
If you do not specify a pipeline, the default pipeline will be run! | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,3 +3,4 @@ sphinx-copybutton | |
sphinx-tabs | ||
sphinx-toolbox | ||
sphinx-rtd-theme | ||
watchdog[watchmedo] |
Oops, something went wrong.