Skip to content

Commit

Permalink
Merge pull request #47 from cedadev/1.3_revisions
Browse files Browse the repository at this point in the history
1.3 revisions
  • Loading branch information
dwest77a authored Jan 8, 2025
2 parents f8df80d + 5109cac commit e35a062
Show file tree
Hide file tree
Showing 52 changed files with 2,814 additions and 993 deletions.
Binary file added docs/source/_images/CedaArchive0824.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_images/DataDistributed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 0 additions & 6 deletions docs/source/allocation.rst

This file was deleted.

136 changes: 0 additions & 136 deletions docs/source/assess-overview.rst

This file was deleted.

5 changes: 0 additions & 5 deletions docs/source/assess.rst

This file was deleted.

77 changes: 6 additions & 71 deletions docs/source/cci_water.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ A new *group* is created within the pipeline using the ``init`` operation as fol

::

python group_run.py init <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v
padocc init -G <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v

.. note::

Multiple flag options are available throughout the pipeline for more specific operations and methods. In the above case we have used the (-v) *verbose* flag to indicate we want to see the ``[INFO]`` messages put out by the pipeline. Adding a second (v) would also show ``[DEBUG]`` messages.
Also the ``init`` phase is always run as a serial process since it just involves creating the directories and config files required by the pipeline.

Expand Down Expand Up @@ -56,28 +57,6 @@ Ok great, we've initialised the pipeline for our new group! Here's a summary dia
- validate.log
- status_log.csv

For peace of mind and to check you understand the pipeline assessor tool we would suggest running this command next:

::

python assess.py progress my_new_group

Upon which your output should look something like this:

.. code-block:: console
Group: my_new_group
Total Codes: 4
Pipeline Current:
init : 4 [100.%] (Variety: 1)
- complete : 4
Pipeline Complete:
complete : 0 [0.0 %]
All 4 of our datasets were initialised successfully, no datasets are complete through the pipeline yet.

The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which would complete the pipeline.
Expand All @@ -88,52 +67,8 @@ The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which

.. code-block:: console
python group_run.py scan my_new_group
python group_run.py compute my_new_group
python group_run.py validate my_new_group
An more complex example of what you might see while running the pipeline in terms of errors encountered can be found below:

.. code-block:: console
Group: cci_group_v1
Total Codes: 361
Pipeline Current:
compute : 21 [5.8 %] (Variety: 2)
- complete : 20
- KeyError 'refs' : 1
Pipeline Complete:
complete : 185 [51.2%]
blacklist : 155 [42.9%] (Variety: 8)
- NonKerchunkable : 50
- PartialDriver : 3
- PartialDriverFail : 5
- ExhaustedMemoryLimit : 56
- ExhaustedTimeLimit : 18
- ExhaustedTimeLimit* : 1
- ValidationMemoryLimit : 21
- ScipyDimIssue : 1
In this example ``cci_group_v1`` group, 185 of the datasets have completed the pipeline, while 155 have been excluded (See blacklisting in the Assessor Tool section).
Of the remaining 21 datasets, 20 of them have completed the ``compute`` phase and now need to be run through ``validate``, but one encountered a KeyError which needs to be inspected. To view the log for this dataset we can use the command below:

.. code-block:: console
python assess.py progress cci_group_v1 -e "KeyError 'refs'" -p compute -E
This will match with our ``compute``-phase error with that message, and the (-E) flag will give us the whole error log from that run. This may be enough to assess and fix the issue but otherwise, to rerun just this dataset a rerun command will be suggested by the assessor:

.. code-block:: console
Project Code: 201601-201612-ESACCI-L4_FIRE-BA-MSI-fv1.1 - <class 'KeyError'>'refs'
Rerun suggested command: python single_run.py compute 218 -G cci_group_v1 -vv -d
This rerun command has several flags included, the most importand here is the (-G) group flag, since we need to use the ``single_run`` script so now need to specify the group. The (-d) dryrun flag will simply mean we are not producing any output files since we may need to test and rerun several times.


padocc scan -G my_new_group
padocc compute -G my_new_group
padocc validate -G my_new_group
This section will be updated for the full release of v1.3 with additional content relating to the assessor tool.
7 changes: 0 additions & 7 deletions docs/source/compute.rst

This file was deleted.

65 changes: 65 additions & 0 deletions docs/source/deep_dive.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
===================================
A Deeper Dive into PADOCC Mechanics
===================================

Revision Numbers
----------------

The PADOCC revision numbers for each product are auto-generated using the following rules.

* All projects begin with the revision number ``1.1``.
* The first number denotes major updates to the product, for instance where a data source file has been replaced.
* The second number denotes minor changes like alterations to attributes and metadata.
* The letters prefixed to the revision numbers identify the file type for the product. For example a zarr store has the letter ``z`` applied, while a Kerchunk (parquet) store has ``kp``.

The Validation Report
---------------------

The ``ValidateDatasets`` class produces a validation report for both data and metadata validations.
This is designed to be fairly simple to interpret, while still being machine-readable.
The following headings which may be found in the report have the following meanings:

1. Metadata Report (with Examples)
These are considered non-fatal errors that will need either a minor correction or can be ignored.

* ``variables.time: {'type':'missing'...}`` - The time variable is missing from the specified product.
* ``dims.all_dims: {'type':'order'}`` - The ordering of dimensions is not consistent across products.
* ``attributes {'type':'ignore'...}`` - Attributes that have been ignored. These may have already been edited.
* ``attributes {'type':'missing'...}`` - Attributes that are missing from the specified product file.
* ``attributes {'type':'not_equal'...}`` - Attributes that are not equal across products.

2. Data Report
These are considered **fatal** errors that need a major correction or possibly a fix to the pipeline itself.

* ``size_errors`` - The size of the array is not consistent between products.
* ``dim_errors`` - Arrays have inconsistent dimensions (where not ignored).
* ``dim_size_errors`` - The dimensions are consistent for a variable but their sizes are not.
* ``data_errors`` - The data arrays do not match across products, this is the most fatal of all validation errors.
The validator should give an idea of which array comparisons failed.
* ``data_errors: {'type':'growbox_exceeded'...}`` - The variable in question could not be validated as no area could be identified that is not empty of values.

BypassSwitch Options
--------------------

Certain non-fatal errors may be bypassed using the Bypass flag:
::

Format: -b "D"

Default: "D" # Highlighted by a '*'

"D" - * Skip driver failures - Pipeline tries different options for NetCDF (default).
- Only need to turn this skip off if all drivers fail (KerchunkDriverFatalError).
"F" - Skip scanning (fasttrack) and go straight to compute. Required if running compute before scan
is attempted.
"L" - Skip adding links in compute (download links) - this will be required on ingest.
"S" - Skip errors when running a subset within a group. Record the error then move onto the next dataset.

Custom Pipeline Errors
----------------------

**A summary of the custom errors that are experienced through running the pipeline.**

.. automodule:: padocc.core.errors
:members:
:show-inheritance:
8 changes: 0 additions & 8 deletions docs/source/errors.rst

This file was deleted.

8 changes: 0 additions & 8 deletions docs/source/execution-source.rst

This file was deleted.

Loading

0 comments on commit e35a062

Please sign in to comment.