Merge pull request #47 from cedadev/1.3_revisions

1.3 revisions
cedadev · Jan 8, 2025 · e35a062 · e35a062
2 parents f8df80d + 5109cac
commit e35a062
Show file tree

Hide file tree

Showing 52 changed files with 2,814 additions and 993 deletions.
diff --git a/docs/source/_images/CedaArchive0824.png b/docs/source/_images/CedaArchive0824.png
diff --git a/docs/source/_images/DataDistributed.png b/docs/source/_images/DataDistributed.png
diff --git a/docs/source/allocation.rst b/docs/source/allocation.rst
diff --git a/docs/source/assess-overview.rst b/docs/source/assess-overview.rst
diff --git a/docs/source/assess.rst b/docs/source/assess.rst
diff --git a/docs/source/cci_water.rst b/docs/source/cci_water.rst
@@ -8,9 +8,10 @@ A new *group* is created within the pipeline using the ``init`` operation as fol
 
 ::
 
-    python group_run.py init <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v
+    padocc init -G <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v
 
 .. note::
+
     Multiple flag options are available throughout the pipeline for more specific operations and methods. In the above case we have used the (-v) *verbose* flag to indicate we want to see the ``[INFO]`` messages put out by the pipeline. Adding a second (v) would also show ``[DEBUG]`` messages.
     Also the ``init`` phase is always run as a serial process since it just involves creating the directories and config files required by the pipeline.
 
@@ -56,28 +57,6 @@ Ok great, we've initialised the pipeline for our new group! Here's a summary dia
                      -  validate.log
                  -  status_log.csv
 
-For peace of mind and to check you understand the pipeline assessor tool we would suggest running this command next:
-
-::
-
-    python assess.py progress my_new_group
-
-Upon which your output should look something like this:
-
-.. code-block:: console
-
-    Group: my_new_group
-    Total Codes: 4
-
-    Pipeline Current:
-
-    init      : 4     [100.%] (Variety: 1)
-        - complete : 4
-
-    Pipeline Complete:
-
-    complete  : 0     [0.0 %]
-
 All 4 of our datasets were initialised successfully, no datasets are complete through the pipeline yet.
 
 The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which would complete the pipeline.
@@ -88,52 +67,8 @@ The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which
 
 .. code-block:: console
 
-    python group_run.py scan my_new_group
-    python group_run.py compute my_new_group
-    python group_run.py validate my_new_group
-
-An more complex example of what you might see while running the pipeline in terms of errors encountered can be found below:
-
-.. code-block:: console
-
-    Group: cci_group_v1
-    Total Codes: 361
-
-    Pipeline Current:
-
-    compute   : 21    [5.8 %] (Variety: 2)
-        - complete                 : 20
-        - KeyError 'refs'          : 1
-
-    Pipeline Complete:
-
-    complete  : 185   [51.2%]
-
-    blacklist : 155   [42.9%] (Variety: 8)
-        - NonKerchunkable          : 50
-        - PartialDriver            : 3
-        - PartialDriverFail        : 5
-        - ExhaustedMemoryLimit     : 56
-        - ExhaustedTimeLimit       : 18
-        - ExhaustedTimeLimit*      : 1
-        - ValidationMemoryLimit    : 21
-        - ScipyDimIssue            : 1
-
-In this example ``cci_group_v1`` group, 185 of the datasets have completed the pipeline, while 155 have been excluded (See blacklisting in the Assessor Tool section). 
-Of the remaining 21 datasets, 20 of them have completed the ``compute`` phase and now need to be run through ``validate``, but one encountered a KeyError which needs to be inspected. To view the log for this dataset we can use the command below:
-
-.. code-block:: console
-
-    python assess.py progress cci_group_v1 -e "KeyError 'refs'" -p compute -E
-
-This will match with our ``compute``-phase error with that message, and the (-E) flag will give us the whole error log from that run. This may be enough to assess and fix the issue but otherwise, to rerun just this dataset a rerun command will be suggested by the assessor:
-
-.. code-block:: console
-
-    Project Code: 201601-201612-ESACCI-L4_FIRE-BA-MSI-fv1.1 - <class 'KeyError'>'refs'
-    Rerun suggested command:    python single_run.py compute 218 -G cci_group_v1 -vv -d
-
-This rerun command has several flags included, the most importand here is the (-G) group flag, since we need to use the ``single_run`` script so now need to specify the group. The (-d) dryrun flag will simply mean we are not producing any output files since we may need to test and rerun several times.
-
-
+    padocc scan -G my_new_group
+    padocc compute -G my_new_group
+    padocc validate -G my_new_group
 
+This section will be updated for the full release of v1.3 with additional content relating to the assessor tool.
diff --git a/docs/source/compute.rst b/docs/source/compute.rst
diff --git a/docs/source/deep_dive.rst b/docs/source/deep_dive.rst
@@ -0,0 +1,65 @@
+===================================
+A Deeper Dive into PADOCC Mechanics
+===================================
+
+Revision Numbers
+----------------
+
+The PADOCC revision numbers for each product are auto-generated using the following rules.
+
+ * All projects begin with the revision number ``1.1``.
+ * The first number denotes major updates to the product, for instance where a data source file has been replaced.
+ * The second number denotes minor changes like alterations to attributes and metadata.
+ * The letters prefixed to the revision numbers identify the file type for the product. For example a zarr store has the letter ``z`` applied, while a Kerchunk (parquet) store has ``kp``.
+
+The Validation Report
+---------------------
+
+The ``ValidateDatasets`` class produces a validation report for both data and metadata validations. 
+This is designed to be fairly simple to interpret, while still being machine-readable. 
+The following headings which may be found in the report have the following meanings:
+
+1. Metadata Report (with Examples)
+These are considered non-fatal errors that will need either a minor correction or can be ignored.
+
+* ``variables.time: {'type':'missing'...}`` - The time variable is missing from the specified product.
+* ``dims.all_dims: {'type':'order'}`` - The ordering of dimensions is not consistent across products.
+* ``attributes {'type':'ignore'...}`` - Attributes that have been ignored. These may have already been edited.
+* ``attributes {'type':'missing'...}`` - Attributes that are missing from the specified product file.
+* ``attributes {'type':'not_equal'...}`` - Attributes that are not equal across products.
+
+2. Data Report
+These are considered **fatal** errors that need a major correction or possibly a fix to the pipeline itself.
+
+* ``size_errors`` - The size of the array is not consistent between products.
+* ``dim_errors`` - Arrays have inconsistent dimensions (where not ignored).
+* ``dim_size_errors`` - The dimensions are consistent for a variable but their sizes are not.
+* ``data_errors`` - The data arrays do not match across products, this is the most fatal of all validation errors. 
+The validator should give an idea of which array comparisons failed.
+* ``data_errors: {'type':'growbox_exceeded'...}`` - The variable in question could not be validated as no area could be identified that is not empty of values.
+
+BypassSwitch Options
+--------------------
+
+Certain non-fatal errors may be bypassed using the Bypass flag:
+::
+
+  Format: -b "D"
+
+  Default: "D" # Highlighted by a '*'
+
+  "D" - * Skip driver failures - Pipeline tries different options for NetCDF (default).
+      -   Only need to turn this skip off if all drivers fail (KerchunkDriverFatalError).
+  "F" -   Skip scanning (fasttrack) and go straight to compute. Required if running compute before scan
+          is attempted.
+  "L" -   Skip adding links in compute (download links) - this will be required on ingest.
+  "S" -   Skip errors when running a subset within a group. Record the error then move onto the next dataset.
+
+Custom Pipeline Errors
+----------------------
+
+**A summary of the custom errors that are experienced through running the pipeline.**
+
+.. automodule:: padocc.core.errors
+    :members:
+    :show-inheritance:
diff --git a/docs/source/errors.rst b/docs/source/errors.rst
diff --git a/docs/source/execution-source.rst b/docs/source/execution-source.rst