Merge pull request #25 from esm-tools/feat/pipelines

Runnable Pipelines
esm-tools · Aug 7, 2024 · 6ef31a4 · 6ef31a4
2 parents 3445e26 + d88c748
commit 6ef31a4
Show file tree

Hide file tree

Showing 38 changed files with 1,267 additions and 432 deletions.
diff --git a/doc/conf.py b/doc/conf.py
@@ -75,13 +75,16 @@
     "python": ("http://docs.python.org/", None),
     "numpy": ("http://docs.scipy.org/doc/numpy/", None),
     "scipy": ("http://docs.scipy.org/doc/scipy/reference/", None),
-    "matplotlib": ("http://matplotlib.sourceforge.net/", None),
+    "matplotlib": ("https://matplotlib.org/stable/", None),
     "pandas": ("http://pandas.pydata.org/pandas-docs/stable/", None),
     "xarray": ("http://xarray.pydata.org/en/stable/", None),
     "chemicals": ("https://chemicals.readthedocs.io/", None),
 }
 
 
+# -- Custom directives --------------------------------------------------------
+napoleon_custom_sections = [("Mutates", "params_style")]
+
 # -- Options for HTML output -------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
 

diff --git a/doc/developer_guide.rst b/doc/developer_guide.rst
@@ -1,43 +1,260 @@
 Developer Guide
 ===============
 
-Thanks for helping develop ``pymorize``!. This document will guide you through
+Thanks for helping develop ``pymorize``! This document will guide you through
 the code structure and layout, and provide a few tips on how to contribute.
 
-Code Layout
------------
+Getting Started
+---------------
+To get started, you should clone the repository and install the dependencies. We give
+a few extra dependencies for testing and documentation, so you should install these as well::
 
-We use a `src` layout, with all files living under `./src/pymorize`. The code is
+    git clone https://github.com/esm-tools/pymorize.git
+    cd pymorize
+    pip install -e ".[dev,doc]"
+
+This will install the package in "editable" mode, so you can make changes to the code. The
+``dev`` and ``doc`` extras will install the dependencies needed for testing and documentation,
+respectively. Before changing anything, make sure that the tests are passing::
+
+    pytest
+
+Next, you should familiarize yourself with the code layout and the main building blocks. These
+are described in the next section.
+
+
+Code Layout and Main Classes
+----------------------------
+
+We use a ``src`` layout, with all files living under ``./src/pymorize``. The code is
 generally divided into several building blocks:
 
-There are a few main modules and classes you should be aware of:
+* :py:class:`~pymorize.rule.Rule` is the main class that defines a rule for processing a CMOR variable.
+
+* :py:class:`~pymorize.pipeline.Pipeline` (or an object inherited from :py:class:`~pymorize.pipeline.Pipeline`) is a collection
+  of actions that can be applied to a set of files described by a :py:class:`~pymorize.rule.Rule`. A few default pipelines are
+  provided, and you can also define your own.
+
+* :py:class:`~pymorize.cmorizer.CMORizer` is responsible for reading in the rules, and managing the various
+  objects. 
 
-* ``Rule`` is the main class that defines a rule for processing a CMOR variable. It is
-  defined in ``rule.py``. ``Rule`` objects have several attributes which are useful:
+:py:class:`~pymorize.rule.Rule` Class
+-------------------------------------
+
+This is the main building block for handling a set of files produced by a model. It has the following attributes:
 
   1. ``input_patterns``: A list of regular expressions that are used to match the
      input file name. Note that this is **regex**, not globbing!
   2. ``cmor_variable``: The ``CMOR`` name of the variable.
-  3. ``actions``: A list of actions to be performed on the input file. An ``action`` is
-       a partially-applied function, where the only remaining argument is the input file.
-
-      **Design Note**: The reason for this design is that it allows us to define a
-      list of actions that can be applied to a file, and then apply them all at once
-      when we find a match via a loop. It *has not* yet been determined if what we will
-      return here. It is probably easiest to return a list of `xarray` objects, corresponding
-      to each action. This allows us to grab out the data we need at any step in the chain.
-
-* ``Pipeline``: (**NOT YET IMPLEMENTED**) A ``Pipeline`` is a collection of rules that can
-  be applied to a file. It is (will be) defined in ``pipeline.py``.
-
-* ``CMORizer``: This is the main class, if anything can be called that. It is defined in
-  ``cmorizer.py``. It is responsible for reading in the rules, and managing the various
-  objects. **NOTE**: A ``CMORIZER`` object should **not** have any knowledge about files,
-  just about rules and actions. This is because we want to be able to test the rules
-  without having to worry about files (separation of concerns). This design is of course
-  subject to change. The ``CMORIZER`` object should have a method ``apply_rules`` that
-  takes a list of files, and applies the rules to them. (Alternatively, it should be a callable,
-  Paul hasn't decided yet)
-
-``rule.py`` contains the main ``Rule`` class. It should be used to define a matching
-between an output file and a CMOR name. d rules. This list is used to match the
+  3. ``pipelines``: A list of pipeline names that should be applied to the data.
+
+Any other attributes can be added to the rule, and will appear in the ``rule_spec`` as attributes of the ``Rule`` object. In YAML, a minimal rule
+looks like this:
+
+.. code-block:: yaml
+
+    input_patterns: [".*"]
+    cmor_variable: tas
+    pipelines: [My Pipeline]
+
+
+:py:class:`~pymorize.pipeline.Pipeline` Class
+---------------------------------------------
+
+The :py:class:`~pymorize.pipeline.Pipeline` class is a collection of actions that can be applied to a set of files. It should have a
+``name`` attribute that describes the pipeline. If not given during construction, a random one is generated. The actions are stored in a list, and
+are applied in order. There are a few ways to construct a pipeline. You can either create one from a list of actions (also called steps)::
+
+    >>> pipeline = pymorize.pipeline.Pipeline([action1, action2], name="My Pipeline")
+    >>> # Or use the class method:
+    >>> pl = Pipeline.from_list([action1, action2], name="My Pipeline")
+
+where ``action1`` and ``action2`` are functions that follow the pipeline step protocol. See :ref:`the guide on building actions <building-actions-for-pipelines>`
+for more information.
+
+Another way to build actions is from a list of qualified names of functions. A class method is provided to do this easily::
+
+    >>> my_pipeline = Pipeline.from_qualnames(["my_module.my_action1", "my_module.my_action2"], name="My Pipeline")
+
+
+
+:py:class:`~pymorize.cmorizer.CMORizer` Class
+---------------------------------------------
+
+The :py:class:`~pymorize.cmorizer.CMORizer` class is responsible for managing the rules and pipelines. It contains four configuration dictionaries:
+
+1. ``pymorize_cfg``: This is the configuration for the ``pymorize`` package. It should contain a version number, and any other configuration
+   that is needed for the package to run. This is used to check that the configuration is correct for the specific version of ``pymorize``. You
+   can also specify certain features to be enabled or disabled here, as well as configure the logging.
+
+2. ``global_cfg``: This is the global configuration for the rules and pipelines. This is used for configuration that is common to all rules and pipelines,
+   such as the path to the CMOR tables, or the path to the output directory. This is used to set up the environment for the rules and pipelines.
+
+3. ``pipelines``: This is a list of :py:class:`~pymorize.pipeline.Pipeline` objects that are used to process the data. These are the pipelines that are
+   applied to the data, and are referenced by the rules. Each pipeline should have a unique name, and a series of steps to perform. You can also specify 
+   "frozen" arguments and key-word arguments to apply to steps in the pipeline's configuration.
+
+4. ``rules``: This is a list of :py:class:`~pymorize.rule.Rule` objects that are used to match the data. Each rule should have a unique name, and a series of
+   input patterns, a CMOR variable name, and a list of pipelines to apply to the data. You can also specify additional attributes that are used in the actions
+   in the pipelines.
+
+.. _building-actions-for-pipelines:
+
+Building Actions for Pipelines
+------------------------------
+
+When defining actions for a :py:class:`~pymorize.pipeline.Pipeline`, you should create functions
+with the following signature::
+
+    def my_action(data: Any, 
+                  rule_spec: pymorize.rule.Rule, 
+                  cmorizer: pymorize.cmorizer.CMORizer, 
+                  *args, **kwargs) -> Any:
+        ...
+        return data
+
+The ``data`` argument is the data that is passed from one action to the next. The ``rule_spec`` is the
+instance of the :py:class:`~pymorize.rule.Rule` class that is currently being evaluated. The ``cmorizer`` 
+is the instance of the :py:class:`~pymorize.cmorizer.CMORizer` class that is managing the pipeline. You 
+can pass additional arguments to the action by using ``*args`` and ``**kwargs``, however most arguments or 
+keyword arguments should be extracted from the ``rule_spec``. The action should return the data that will be
+passed to the next action in the pipeline. Note that the data can be any type, but it should be the same type
+as what is expected in the next action in the pipeline.
+
+.. note::
+
+   If needed, you can construct "conversion" actions that will convert the data from one type to another and pass
+   it to the next step.
+
+When defining actions, you should also add a docstring that describes what the action does. This will be printed
+when the user asks for help on the action. Note that whenever possible, you should use the ``rule_spec`` to pass
+information into your action, rather than hardcoding it or passing in arguments. You can also use additional arguments
+if needed, and these can be fixed to always use the same values for the entire pipeline the action belongs to, or,
+alternatively, to the rule that the action is a part of. A few illustrative examples may make this clearer.
+
+* Example 1: A simple action that adds 1 to the data::
+  
+      def add_one(data: Any, rule_spec: pymorize.rule.Rule, cmorizer: pymorize.cmorizer.CMORizer) -> Any:
+          """Add one to the data."""
+          return data + 1
+
+  Using this in a pipeline would look like this in Python code::
+
+      pipeline = pymorize.pipeline.Pipeline([add_one], name="Add One")
+      rule_spec = pymorize.rule.Rule(input_patterns=[".*"], cmor_variable="tas", pipelines=["Add One"])
+      cmorizer = pymorize.cmorizer.CMORizer(pymorize_cfg={"version": "unreleased"}, global_cfg={}, rules=[rule_spec], pipelines=[pipeline])
+      initial_data = 1
+      data = pipeline.run(initial_data, rule_spec, cmorizer)
+
+  In yaml, the same pipeline and configuration looks like this:
+
+  .. code-block:: yaml
+
+      pymorize:
+        version: unreleased
+
+      general:
+
+      pipelines:
+        - name: Add One
+          actions:
+            - add_one
+      rules:
+        - input_patterns: [".*"]
+          cmor_variable: tas
+          pipelines: [Add One]
+
+* Example 2: An action that sets an attribute on a :py:class:`xarray.Dataset`, where this is specified in
+  the rule specification::
+
+      def set_attribute(data: xr.Dataset, rule_spec: pymorize.rule.Rule, cmorizer: pymorize.cmorizer.CMORizer) -> xr.Dataset:
+          """Set an attribute on the dataset."""
+          data.attrs[rule_spec.attribute_name] = rule_spec.attribute_value
+          return data
+
+  Using this in a pipeline would look like this in yaml:
+
+  .. code-block:: yaml
+
+      pymorize:
+        version: unreleased
+
+      general:
+
+      pipelines:
+        - name: Set Attribute
+          actions:
+            - set_attribute
+      rules:
+        - input_patterns: [".*"]
+          cmor_variable: tas
+          pipelines: [Set Attribute]
+          attribute_name: "my_attribute"
+          attribute_value: "my_value"
+
+* Example 3: An action that sets an attribute on a :py:class:`~xarray.Dataset`, where this is specified in the :py:class:`~pymorize.pipeline.Pipeline`.
+
+  It is the responsibility of the action developer to ensure arguments are passed correctly and have sensible values. This is a more complicated example. Here we check
+  if the rule has a specific attribute that matches the action's name, with "``_args``" appended. We use those values if that is the case. Otherwise, they can be obtained from
+  the pipeline, and default to empty strings. As an action developer, you need to ensure sensible logic here!
+
+  .. code-block::
+
+      def set_attribute(data: xr.Dataset, rule_spec: pymorize.rule.Rule, cmorizer: pymorize.cmorizer.CMORizer, attribute_name: str = "", attribute_value: str = "", *args, **kwargs) -> xr.Dataset:
+          """Set an attribute on the dataset."""
+          if hasattr(rule_spec, f"{__name__}_args"):
+              attribute_name = getattr(rule_spec, f"{__name__}_args").get("attribute_name", my_attribute)
+              attribute_value = getattr(rule_spec, f"{__name__}_args").get("attribute_value", my_value)
+          data.attrs[attribute_name] = attribute_value
+          return data
+
+  Using this in a pipeline would look like this in yaml:
+
+  .. code-block:: yaml
+
+      pymorize:
+        version: unreleased
+
+      general:
+
+      pipelines:
+        - name: Set Attribute
+          actions:
+            - set_attribute
+          attribute_name: "my_attribute"
+          attribute_value: "my_value"
+      rules:
+        - input_patterns: [".*"]
+          cmor_variable: tas
+          pipelines: [Set Attribute]
+  
+  .. important::
+
+      In the case of passing arguments that are *not* in the rule spec, you need to be careful about where you place the information. The :py:class:`~pymorize.rule.Rule` should win, if
+      there are conflicts between the rule and the pipeline. This is because the rule is the most specific, and the pipeline is the most general. So, to have a value specified in
+      the rule, you should do:
+
+      .. code-block:: yaml
+    
+            pymorize:
+              version: unreleased
+    
+            general:
+    
+            pipelines:
+              - name: Set Attribute
+                actions:
+                  - set_attribute
+                attribute_name: "my_attribute"
+                attribute_value: "my_value"
+            rules:
+              - input_patterns: [".*"]
+                cmor_variable: tas
+                pipelines: [Set Attribute]
+                set_attribute_args:
+                  attribute_name: "my_other_attribute"
+                  attribute_value: "my_other_value"
+
+.. attention::
+
+   If you want more examples in the handbook, please open an issue or a pull request!
diff --git a/doc/index.rst b/doc/index.rst
@@ -13,6 +13,7 @@ Contents
    :maxdepth: 2
 
    installation
+   pymorize_building_blocks
    pymorize_config_file
    including_subcommand_plugins
    developer_guide

diff --git a/doc/pymorize_building_blocks.rst b/doc/pymorize_building_blocks.rst
@@ -0,0 +1,72 @@
+=====================================
+Usage: ``pymorize``'s Building Blocks
+=====================================
+
+The ``pymorize`` CLI has a few basic concepts you should be familiar with before you start using it. This guide
+will give you a brief overview of these concepts. Suggestions for improvements are always welcome!
+
+Configuration
+-------------
+
+The configuration is the central piece of the ``pymorize`` CLI. It is a YAML file that specifies the behavior of
+the CLI. The configuration file is divided into four sections:
+
+1. ``pymorize``: This section contains the configuration for the CLI itself. It specifies the program version, log verbosity, the location of the user configuration file, and the location of the log file.
+2. ``global``: This section contains information that will be passed to all pipelines. You will specify the location of the data directory (your model output files),
+   the output directory (where re-written data should be stored), the location of the CMOR tables, the location of your model's geometry description file (or files), and
+   any other information that may be needed by all pipelines.
+3. ``pipelines``: This section contains the configuration for the pipelines. Each pipeline is a sequence of operations that will be applied to the data. You can specify the name of the pipeline, the class
+   that implements the pipeline, and any parameters that the pipeline needs.
+4. ``rules``: This section contains the configuration for the rules. Each rule describes a set of files needed to produce a CMOR output variable. You must specify the CMOR variable of interest, the input
+   patterns to use to find the files, and the pipeline(s) to apply to the files.
+
+Pipelines
+---------
+
+:py:class:`~pymorize.pipeline.Pipeline`'s come in two flavors. The first is a predefined pipeline, which is
+attached to the configuration via a ``uses`` directive. In the user configuration file, you would specify it
+like this:
+
+  .. code-block:: yaml
+  
+      # ... other configuration
+      pipelines:
+        - name: my_pipeline
+          uses: pymorize.pipeline.DefaultPipeline
+      # ... other configuration
+
+Alternatively you can define your own pipeline by specifying the steps it should take. Here is an example of a
+custom pipeline:
+
+  .. code-block:: yaml
+  
+      # ... other configuration
+      pipelines:
+        - name: my_pipeline
+          steps:
+            - pymorize.generic.dummy_load_data
+            - pymorize.generic.dummy_process_data
+            - pymorize.generic.dummy_save_data
+      # ... other configuration
+
+Rules
+-----
+
+Rules are the heart of the ``pymorize`` CLI. They specify the files needed to produce a CMOR output variable. Each rule has a name, a CMOR variable, and a list of input patterns. The input patterns are used to find the files needed to produce the CMOR output variable. Here is an example of a rule:
+
+  .. code-block:: yaml
+  
+      # ... other configuration
+      rules:
+        - name: my_rule
+          cmor_variable: tas
+          patterns:
+            - 'tas_*.nc'
+          pipelines:
+            - my_pipeline
+      # ... other configuration
+
+  .. note::
+
+       If you do not specify a pipeline, the default pipeline will be run!
+
diff --git a/doc/requirements.txt b/doc/requirements.txt
@@ -3,3 +3,4 @@ sphinx-copybutton
 sphinx-tabs
 sphinx-toolbox
 sphinx-rtd-theme
+watchdog[watchmedo]