Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.3 revisions #47

Merged
merged 35 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
763767f
Test typing rendering for documentation
dwest77a Dec 20, 2024
b26f7a2
Removed import issue
dwest77a Dec 20, 2024
0def9e2
Fixed issue with FalseLogger
dwest77a Dec 20, 2024
c1245aa
Switched to iterator over generator
dwest77a Dec 20, 2024
3978bf1
Updated all auto docs
dwest77a Dec 20, 2024
50fb0b3
Updated index
dwest77a Dec 20, 2024
f32598c
Fixed issues with docs build
dwest77a Dec 20, 2024
5359735
Added initial Shepard configurations
dwest77a Dec 20, 2024
17e2390
Syntax changes and added shepard entrypoint
dwest77a Dec 20, 2024
7c7775b
Syntax fixes
dwest77a Dec 20, 2024
ea20785
Total overhaul of filehandlers for mypy consistency
dwest77a Dec 23, 2024
2881e6e
Updated all tests for filehandler refactors
dwest77a Dec 23, 2024
5b904e0
Major refactoring with filehandlers
dwest77a Dec 23, 2024
0a8149d
Refactorings due to filehanders
dwest77a Dec 23, 2024
560e007
Minor edits
dwest77a Dec 23, 2024
05b25df
Added group merge/unmerge methods - pre-release
dwest77a Dec 23, 2024
3a4dbc1
Reordered docs pages
dwest77a Dec 23, 2024
a83a3d6
Added introductory documentation pages
dwest77a Dec 23, 2024
4cc1123
Added basic descriptions for several sections
dwest77a Dec 23, 2024
98e7872
Added brief messages about each phase
dwest77a Dec 23, 2024
3bbfdb0
Initial commit of shepard documentation
dwest77a Dec 23, 2024
28cd361
Various syntax changes in documentation, merged intro and phases, upd…
dwest77a Jan 6, 2025
6f9dec6
Minor syntax changes, removed old features of bypass switch no longer…
dwest77a Jan 6, 2025
d3d7202
Added CLI, various changes to enable CLI script
dwest77a Jan 6, 2025
39cb070
Added phase map
dwest77a Jan 6, 2025
07c495a
Added images for inspiration section
dwest77a Jan 6, 2025
3c7e2db
Added main padocc cli
dwest77a Jan 6, 2025
ede143b
Added temp file
dwest77a Jan 7, 2025
99b059e
updated temp file
dwest77a Jan 7, 2025
48e1d32
Various revisions
dwest77a Jan 7, 2025
0e52a63
Initial commit for lakes dimensional correction, also updated groups …
dwest77a Jan 7, 2025
2c40571
Removed unwanted committed files
dwest77a Jan 7, 2025
6f34f6e
Added release notes document
dwest77a Jan 8, 2025
449310c
Minor syntax changes
dwest77a Jan 8, 2025
5109cac
Updated validate test, removal of subset bypass as a bare parameter"
dwest77a Jan 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/source/_images/CedaArchive0824.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_images/DataDistributed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 0 additions & 6 deletions docs/source/allocation.rst

This file was deleted.

136 changes: 0 additions & 136 deletions docs/source/assess-overview.rst

This file was deleted.

5 changes: 0 additions & 5 deletions docs/source/assess.rst

This file was deleted.

77 changes: 6 additions & 71 deletions docs/source/cci_water.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ A new *group* is created within the pipeline using the ``init`` operation as fol

::

python group_run.py init <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v
padocc init -G <my_new_group> -i extensions/example_water_vapour/water_vapour.csv -v

.. note::

Multiple flag options are available throughout the pipeline for more specific operations and methods. In the above case we have used the (-v) *verbose* flag to indicate we want to see the ``[INFO]`` messages put out by the pipeline. Adding a second (v) would also show ``[DEBUG]`` messages.
Also the ``init`` phase is always run as a serial process since it just involves creating the directories and config files required by the pipeline.

Expand Down Expand Up @@ -56,28 +57,6 @@ Ok great, we've initialised the pipeline for our new group! Here's a summary dia
- validate.log
- status_log.csv

For peace of mind and to check you understand the pipeline assessor tool we would suggest running this command next:

::

python assess.py progress my_new_group

Upon which your output should look something like this:

.. code-block:: console

Group: my_new_group
Total Codes: 4

Pipeline Current:

init : 4 [100.%] (Variety: 1)
- complete : 4

Pipeline Complete:

complete : 0 [0.0 %]

All 4 of our datasets were initialised successfully, no datasets are complete through the pipeline yet.

The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which would complete the pipeline.
Expand All @@ -88,52 +67,8 @@ The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which

.. code-block:: console

python group_run.py scan my_new_group
python group_run.py compute my_new_group
python group_run.py validate my_new_group

An more complex example of what you might see while running the pipeline in terms of errors encountered can be found below:

.. code-block:: console

Group: cci_group_v1
Total Codes: 361

Pipeline Current:

compute : 21 [5.8 %] (Variety: 2)
- complete : 20
- KeyError 'refs' : 1

Pipeline Complete:

complete : 185 [51.2%]

blacklist : 155 [42.9%] (Variety: 8)
- NonKerchunkable : 50
- PartialDriver : 3
- PartialDriverFail : 5
- ExhaustedMemoryLimit : 56
- ExhaustedTimeLimit : 18
- ExhaustedTimeLimit* : 1
- ValidationMemoryLimit : 21
- ScipyDimIssue : 1

In this example ``cci_group_v1`` group, 185 of the datasets have completed the pipeline, while 155 have been excluded (See blacklisting in the Assessor Tool section).
Of the remaining 21 datasets, 20 of them have completed the ``compute`` phase and now need to be run through ``validate``, but one encountered a KeyError which needs to be inspected. To view the log for this dataset we can use the command below:

.. code-block:: console

python assess.py progress cci_group_v1 -e "KeyError 'refs'" -p compute -E

This will match with our ``compute``-phase error with that message, and the (-E) flag will give us the whole error log from that run. This may be enough to assess and fix the issue but otherwise, to rerun just this dataset a rerun command will be suggested by the assessor:

.. code-block:: console

Project Code: 201601-201612-ESACCI-L4_FIRE-BA-MSI-fv1.1 - <class 'KeyError'>'refs'
Rerun suggested command: python single_run.py compute 218 -G cci_group_v1 -vv -d

This rerun command has several flags included, the most importand here is the (-G) group flag, since we need to use the ``single_run`` script so now need to specify the group. The (-d) dryrun flag will simply mean we are not producing any output files since we may need to test and rerun several times.


padocc scan -G my_new_group
padocc compute -G my_new_group
padocc validate -G my_new_group

This section will be updated for the full release of v1.3 with additional content relating to the assessor tool.
7 changes: 0 additions & 7 deletions docs/source/compute.rst

This file was deleted.

65 changes: 65 additions & 0 deletions docs/source/deep_dive.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
===================================
A Deeper Dive into PADOCC Mechanics
===================================

Revision Numbers
----------------

The PADOCC revision numbers for each product are auto-generated using the following rules.

* All projects begin with the revision number ``1.1``.
* The first number denotes major updates to the product, for instance where a data source file has been replaced.
* The second number denotes minor changes like alterations to attributes and metadata.
* The letters prefixed to the revision numbers identify the file type for the product. For example a zarr store has the letter ``z`` applied, while a Kerchunk (parquet) store has ``kp``.

The Validation Report
---------------------

The ``ValidateDatasets`` class produces a validation report for both data and metadata validations.
This is designed to be fairly simple to interpret, while still being machine-readable.
The following headings which may be found in the report have the following meanings:

1. Metadata Report (with Examples)
These are considered non-fatal errors that will need either a minor correction or can be ignored.

* ``variables.time: {'type':'missing'...}`` - The time variable is missing from the specified product.
* ``dims.all_dims: {'type':'order'}`` - The ordering of dimensions is not consistent across products.
* ``attributes {'type':'ignore'...}`` - Attributes that have been ignored. These may have already been edited.
* ``attributes {'type':'missing'...}`` - Attributes that are missing from the specified product file.
* ``attributes {'type':'not_equal'...}`` - Attributes that are not equal across products.

2. Data Report
These are considered **fatal** errors that need a major correction or possibly a fix to the pipeline itself.

* ``size_errors`` - The size of the array is not consistent between products.
* ``dim_errors`` - Arrays have inconsistent dimensions (where not ignored).
* ``dim_size_errors`` - The dimensions are consistent for a variable but their sizes are not.
* ``data_errors`` - The data arrays do not match across products, this is the most fatal of all validation errors.
The validator should give an idea of which array comparisons failed.
* ``data_errors: {'type':'growbox_exceeded'...}`` - The variable in question could not be validated as no area could be identified that is not empty of values.

BypassSwitch Options
--------------------

Certain non-fatal errors may be bypassed using the Bypass flag:
::

Format: -b "D"

Default: "D" # Highlighted by a '*'

"D" - * Skip driver failures - Pipeline tries different options for NetCDF (default).
- Only need to turn this skip off if all drivers fail (KerchunkDriverFatalError).
"F" - Skip scanning (fasttrack) and go straight to compute. Required if running compute before scan
is attempted.
"L" - Skip adding links in compute (download links) - this will be required on ingest.
"S" - Skip errors when running a subset within a group. Record the error then move onto the next dataset.

Custom Pipeline Errors
----------------------

**A summary of the custom errors that are experienced through running the pipeline.**

.. automodule:: padocc.core.errors
:members:
:show-inheritance:
8 changes: 0 additions & 8 deletions docs/source/errors.rst

This file was deleted.

8 changes: 0 additions & 8 deletions docs/source/execution-source.rst

This file was deleted.

Loading
Loading