diff --git a/docs/source/_images/CedaArchive0824.png b/docs/source/_images/CedaArchive0824.png new file mode 100644 index 0000000..facf566 Binary files /dev/null and b/docs/source/_images/CedaArchive0824.png differ diff --git a/docs/source/_images/DataDistributed.png b/docs/source/_images/DataDistributed.png new file mode 100644 index 0000000..688a7f2 Binary files /dev/null and b/docs/source/_images/DataDistributed.png differ diff --git a/docs/source/allocation.rst b/docs/source/allocation.rst deleted file mode 100644 index 3c8cb97..0000000 --- a/docs/source/allocation.rst +++ /dev/null @@ -1,6 +0,0 @@ -================= -Allocation Module -================= - -.. automodule:: pipeline.allocate - :members: \ No newline at end of file diff --git a/docs/source/assess-overview.rst b/docs/source/assess-overview.rst deleted file mode 100644 index 9d8368f..0000000 --- a/docs/source/assess-overview.rst +++ /dev/null @@ -1,136 +0,0 @@ -Assessor Tool -============= - -The assessor script ```assess.py``` is an all-purpose pipeline checking tool which can be used to assess: - - The current status of all datasets within a given group in the pipeline (which phase each dataset currently sits in) - - The errors/outputs associated with previous job runs. - - Specific logs from datasets which are presenting a specific type of error. - -An example command to run the assessor tool can be found below: -:: - - python assess.py - -Where the operation can be one of the below options: - - Progress: Get a general overview of the pipeline; how many datasets have completed or are stuck on each phase. - - Display: Display a specific type of information about the pipeline (blacklisted codes, datasets with virtual dimensions or using parquet) - - Match: Match specific attributes within the ``detail-cfg.json`` file (and save to a new ID). - - Summarise: Get an assessment of the data processed for this group - - Upgrade: Update the version of a set of kerchunk files (includes internal metadata standard updates (timestamped, reason provided). - - Cleanup: Remove cached files as part of group runs (errs, outs, repeat_ids etc.) - -1. Progress of a group ----------------------- - -To see the general status of the pipeline for a given group: -:: - - python assess.py progress - -An example output from this command can be seen below: -:: - - Group: cci_group_v1 - Total Codes: 361 - - scan : 1 [0.3 %] (Variety: 1) - - Complete : 1 - - complete : 185 [51.2%] (Variety: 1) - - complete : 185 - - unknown : 21 [5.8 %] (Variety: 1) - - no data : 21 - - blacklist : 162 [44.9%] (Variety: 7) - - NonKerchunkable : 50 - - PartialDriver : 3 - - PartialDriverFail : 5 - - ExhaustedMemoryLimit : 64 - - ExhaustedTimeLimit : 18 - - ExhaustedTimeLimit* : 1 - - ValidationMemoryLimit : 21 - -In this case there are 185 datasets that have completed the pipeline with 1 left to be scanned. The 21 unknowns have no log file so there is no information on these. This will be resolved in later versions where a `seek` function will automatically run when checking the progress, to fix gaps in the logs for missing datasets. - - -An example use case is to write out all datasets that require scanning to a new label (repeat_label): -:: - - python assess.py progress -p scan -r -W - - -The last flag ```-W``` is required when writing an output file from this program, otherwise the program will dryrun and produce no files. - -1.1. Checking errors --------------------- -Check what repeat labels are available already using: -:: - - python assess.py display -s labels - -For listing the status of all datasets from a previous repeat idL -:: - - python assess.py progress -r - - -For selecting a specific type of error (-e) and examine the full log for each example (-E) -:: - - python assess.py progress -r -e "type_of_error" -p scan -E - -Following from this, you may want to rerun the pipeline for just one type of error previously found: -:: - - python assess.py progress -r -e "type_of_error" -p scan -n -W - -.. Note:: - - If you are looking at a specific repeat ID, you can forego the phase (-p) flag, since it is expected this set would appear in the same phase anyway. - The (-W) write flag is also required for any commands that would output data to a file. If the file already exists, you will need to specify an override - level (-O or -OO) for merging or overwriting existing data (project code lists) respectively. - -2. Display options --------------------------- - -Check how many of the datasets in a group have virtual dimensions -:: - - python assess.py display -s virtuals - -3. Match Special Attributes ---------------------------- - -Find the project codes where a specific attribute in ``detail-cfg.json`` matches some given value -:: - - python assess.py match -c "links_added:False" - -4. Summarise data ------------------ - -Summarise the Native/Kerchunk data generated (thus far) for an existing group. -:: - - python assess.py summarise - -5. Upgrade Kerchunk version ---------------------------- - -Upgrade all kerchunk files (compute-validate stages) to a new version for a given reason. This is the 'formal' way of updating the version. -:: - - python assess.py upgrade -r -R "Reason for upgrade" -W -U "krX.X" # New version id - -6. Cleanup ----------- - -"Clean" or remove specific types of files: - - Errors/Outputs in the correct places - - "labels" i.e repeat_ids (including allocations and bands under that repeat_id) - -In the below example we will remove every created ``repeat_id`` (equivalent terminology to 'label') except for ``main``. -:: - - python assess.py cleanup -c labels diff --git a/docs/source/assess.rst b/docs/source/assess.rst deleted file mode 100644 index c508e61..0000000 --- a/docs/source/assess.rst +++ /dev/null @@ -1,5 +0,0 @@ -Assess Module -============= - -.. automodule:: assess - :members: \ No newline at end of file diff --git a/docs/source/cci_water.rst b/docs/source/cci_water.rst index 357f279..94ad2bd 100644 --- a/docs/source/cci_water.rst +++ b/docs/source/cci_water.rst @@ -8,9 +8,10 @@ A new *group* is created within the pipeline using the ``init`` operation as fol :: - python group_run.py init -i extensions/example_water_vapour/water_vapour.csv -v + padocc init -G -i extensions/example_water_vapour/water_vapour.csv -v .. note:: + Multiple flag options are available throughout the pipeline for more specific operations and methods. In the above case we have used the (-v) *verbose* flag to indicate we want to see the ``[INFO]`` messages put out by the pipeline. Adding a second (v) would also show ``[DEBUG]`` messages. Also the ``init`` phase is always run as a serial process since it just involves creating the directories and config files required by the pipeline. @@ -56,28 +57,6 @@ Ok great, we've initialised the pipeline for our new group! Here's a summary dia - validate.log - status_log.csv -For peace of mind and to check you understand the pipeline assessor tool we would suggest running this command next: - -:: - - python assess.py progress my_new_group - -Upon which your output should look something like this: - -.. code-block:: console - - Group: my_new_group - Total Codes: 4 - - Pipeline Current: - - init : 4 [100.%] (Variety: 1) - - complete : 4 - - Pipeline Complete: - - complete : 0 [0.0 %] - All 4 of our datasets were initialised successfully, no datasets are complete through the pipeline yet. The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which would complete the pipeline. @@ -88,52 +67,8 @@ The next steps are to ``scan``, ``compute``, and ``validate`` the datasets which .. code-block:: console - python group_run.py scan my_new_group - python group_run.py compute my_new_group - python group_run.py validate my_new_group - -An more complex example of what you might see while running the pipeline in terms of errors encountered can be found below: - -.. code-block:: console - - Group: cci_group_v1 - Total Codes: 361 - - Pipeline Current: - - compute : 21 [5.8 %] (Variety: 2) - - complete : 20 - - KeyError 'refs' : 1 - - Pipeline Complete: - - complete : 185 [51.2%] - - blacklist : 155 [42.9%] (Variety: 8) - - NonKerchunkable : 50 - - PartialDriver : 3 - - PartialDriverFail : 5 - - ExhaustedMemoryLimit : 56 - - ExhaustedTimeLimit : 18 - - ExhaustedTimeLimit* : 1 - - ValidationMemoryLimit : 21 - - ScipyDimIssue : 1 - -In this example ``cci_group_v1`` group, 185 of the datasets have completed the pipeline, while 155 have been excluded (See blacklisting in the Assessor Tool section). -Of the remaining 21 datasets, 20 of them have completed the ``compute`` phase and now need to be run through ``validate``, but one encountered a KeyError which needs to be inspected. To view the log for this dataset we can use the command below: - -.. code-block:: console - - python assess.py progress cci_group_v1 -e "KeyError 'refs'" -p compute -E - -This will match with our ``compute``-phase error with that message, and the (-E) flag will give us the whole error log from that run. This may be enough to assess and fix the issue but otherwise, to rerun just this dataset a rerun command will be suggested by the assessor: - -.. code-block:: console - - Project Code: 201601-201612-ESACCI-L4_FIRE-BA-MSI-fv1.1 - 'refs' - Rerun suggested command: python single_run.py compute 218 -G cci_group_v1 -vv -d - -This rerun command has several flags included, the most importand here is the (-G) group flag, since we need to use the ``single_run`` script so now need to specify the group. The (-d) dryrun flag will simply mean we are not producing any output files since we may need to test and rerun several times. - - + padocc scan -G my_new_group + padocc compute -G my_new_group + padocc validate -G my_new_group +This section will be updated for the full release of v1.3 with additional content relating to the assessor tool. \ No newline at end of file diff --git a/docs/source/compute.rst b/docs/source/compute.rst deleted file mode 100644 index d5c6fc3..0000000 --- a/docs/source/compute.rst +++ /dev/null @@ -1,7 +0,0 @@ -============== -Compute Module -============== - -.. automodule:: pipeline.compute - :members: - :show-inheritance: \ No newline at end of file diff --git a/docs/source/deep_dive.rst b/docs/source/deep_dive.rst new file mode 100644 index 0000000..5ea05bf --- /dev/null +++ b/docs/source/deep_dive.rst @@ -0,0 +1,65 @@ +=================================== +A Deeper Dive into PADOCC Mechanics +=================================== + +Revision Numbers +---------------- + +The PADOCC revision numbers for each product are auto-generated using the following rules. + + * All projects begin with the revision number ``1.1``. + * The first number denotes major updates to the product, for instance where a data source file has been replaced. + * The second number denotes minor changes like alterations to attributes and metadata. + * The letters prefixed to the revision numbers identify the file type for the product. For example a zarr store has the letter ``z`` applied, while a Kerchunk (parquet) store has ``kp``. + +The Validation Report +--------------------- + +The ``ValidateDatasets`` class produces a validation report for both data and metadata validations. +This is designed to be fairly simple to interpret, while still being machine-readable. +The following headings which may be found in the report have the following meanings: + +1. Metadata Report (with Examples) +These are considered non-fatal errors that will need either a minor correction or can be ignored. + +* ``variables.time: {'type':'missing'...}`` - The time variable is missing from the specified product. +* ``dims.all_dims: {'type':'order'}`` - The ordering of dimensions is not consistent across products. +* ``attributes {'type':'ignore'...}`` - Attributes that have been ignored. These may have already been edited. +* ``attributes {'type':'missing'...}`` - Attributes that are missing from the specified product file. +* ``attributes {'type':'not_equal'...}`` - Attributes that are not equal across products. + +2. Data Report +These are considered **fatal** errors that need a major correction or possibly a fix to the pipeline itself. + +* ``size_errors`` - The size of the array is not consistent between products. +* ``dim_errors`` - Arrays have inconsistent dimensions (where not ignored). +* ``dim_size_errors`` - The dimensions are consistent for a variable but their sizes are not. +* ``data_errors`` - The data arrays do not match across products, this is the most fatal of all validation errors. +The validator should give an idea of which array comparisons failed. +* ``data_errors: {'type':'growbox_exceeded'...}`` - The variable in question could not be validated as no area could be identified that is not empty of values. + +BypassSwitch Options +-------------------- + +Certain non-fatal errors may be bypassed using the Bypass flag: +:: + + Format: -b "D" + + Default: "D" # Highlighted by a '*' + + "D" - * Skip driver failures - Pipeline tries different options for NetCDF (default). + - Only need to turn this skip off if all drivers fail (KerchunkDriverFatalError). + "F" - Skip scanning (fasttrack) and go straight to compute. Required if running compute before scan + is attempted. + "L" - Skip adding links in compute (download links) - this will be required on ingest. + "S" - Skip errors when running a subset within a group. Record the error then move onto the next dataset. + +Custom Pipeline Errors +---------------------- + +**A summary of the custom errors that are experienced through running the pipeline.** + +.. automodule:: padocc.core.errors + :members: + :show-inheritance: \ No newline at end of file diff --git a/docs/source/errors.rst b/docs/source/errors.rst deleted file mode 100644 index 1a379fe..0000000 --- a/docs/source/errors.rst +++ /dev/null @@ -1,8 +0,0 @@ -Custom Pipeline Errors -====================== - -**A summary of the custom errors that are experienced through running the pipeline.** - -.. automodule:: pipeline.errors - :members: - :show-inheritance: \ No newline at end of file diff --git a/docs/source/execution-source.rst b/docs/source/execution-source.rst deleted file mode 100644 index 399113e..0000000 --- a/docs/source/execution-source.rst +++ /dev/null @@ -1,8 +0,0 @@ -Pipeline Execution -================== - -.. automodule:: group_run - :members: - -.. automodule:: single_run - :members: \ No newline at end of file diff --git a/docs/source/execution.rst b/docs/source/execution.rst deleted file mode 100644 index b5d396c..0000000 --- a/docs/source/execution.rst +++ /dev/null @@ -1,192 +0,0 @@ -Pipeline Flags -============== - -==================== -BypassSwitch Options -==================== - -Certain non-fatal errors may be bypassed using the Bypass flag: -:: - - Format: -b "DBSCR" - - Default: "DBSCR" # Highlighted by a '*' - - "D" - * Skip driver failures - Pipeline tries different options for NetCDF (default). - - Only need to turn this skip off if all drivers fail (KerchunkFatalDriverError). - "B" - * Skip Box compute errors. - "S" - * Skip Soft fails (NaN-only boxes in validation) (default). - "C" - * Skip calculation (data sum) errors (time array typically cannot be summed) (default). - "X" - Skip initial shape errors, by attempting XKShape tolerance method (special case.) - "R" - * Skip reporting to status_log which becomes visible with assessor. Reporting is skipped - by default in single_run.py but overridden when using group_run.py so any serial - testing does not by default report the error experienced to the status log for that project. - "F" - Skip scanning (fasttrack) and go straight to compute. Required if running compute before scan - is attempted. - -======================== -Single Dataset Operation -======================== - -Run all single-dataset processes with the ``single-run.py`` script. - -.. code-block:: python - - usage: single_run.py [-h] [-f] [-v] [-d] [-Q] [-B] [-A] [-w WORKDIR] [-g GROUPDIR] [-p PROJ_DIR] - [-t TIME_ALLOWED] [-G GROUPID] [-M MEMORY] [-s SUBSET] - [-r REPEAT_ID] [-b BYPASS] [-n NEW_VERSION] [-m MODE] [-O OVERRIDE_TYPE] - phase proj_code - - Run a pipeline step for a single dataset - - positional arguments: - phase Phase of the pipeline to initiate - proj_code Project identifier code - - options: - -h, --help show this help message and exit - -f, --forceful Force overwrite of steps if previously done - -v, --verbose Print helpful statements while running - -d, --dryrun Perform dry-run (i.e no new files/dirs created) - -Q, --quality Create refs from scratch (no loading), use all NetCDF files in validation - -B, --backtrack Backtrack to previous position, remove files that would be created in this job. - -A, --alloc-bins Use binpacking for allocations (otherwise will use banding) - - -w WORKDIR, --workdir WORKDIR - Working directory for pipeline - -g GROUPDIR, --groupdir GROUPDIR - Group directory for pipeline - -p PROJ_DIR, --proj_dir PROJ_DIR - Project directory for pipeline - -t TIME_ALLOWED, --time-allowed TIME_ALLOWED - Time limit for this job - -G GROUPID, --groupID GROUPID - Group identifier label - -M MEMORY, --memory MEMORY - Memory allocation for this job (i.e "2G" for 2GB) - -s SUBSET, --subset SUBSET - Size of subset within group - -r REPEAT_ID, --repeat_id REPEAT_ID - Repeat id (1 if first time running, _ otherwise) - -b BYPASS, --bypass-errs BYPASS - Bypass switch options: See Above - - -n NEW_VERSION, --new_version NEW_VERSION - If present, create a new version - -m MODE, --mode MODE Print or record information (log or std) - -O OVERRIDE_TYPE, --override_type OVERRIDE_TYPE - Specify cloud-format output type, overrides any determination by pipeline. - -============================= -Multi-Dataset Group Operation -============================= - -Run all multi-dataset group processes within the pipeline using the ``group_run.py`` script. - -.. code-block:: python - - usage: group_run.py [-h] [-S SOURCE] [-e VENVPATH] [-i INPUT] [-A] [--allow-band-increase] [-f] [-v] [-d] [-Q] [-b BYPASS] [-B] [-w WORKDIR] [-g GROUPDIR] - [-p PROJ_DIR] [-G GROUPID] [-t TIME_ALLOWED] [-M MEMORY] [-s SUBSET] [-r REPEAT_ID] [-n NEW_VERSION] [-m MODE] - phase groupID - - Run a pipeline step for a group of datasets - - positional arguments: - phase Phase of the pipeline to initiate - groupID Group identifier code - - options: - -h, --help show this help message and exit - -S SOURCE, --source SOURCE - Path to directory containing master scripts (this one) - -e VENVPATH, --environ VENVPATH - Path to virtual (e)nvironment (excludes /bin/activate) - -i INPUT, --input INPUT - input file (for init phase) - -A, --alloc-bins input file (for init phase) - - --allow-band-increase - Allow automatic banding increase relative to previous runs. - - -f, --forceful Force overwrite of steps if previously done - -v, --verbose Print helpful statements while running - -d, --dryrun Perform dry-run (i.e no new files/dirs created) - -Q, --quality Quality assured checks - thorough run - - -b BYPASS, --bypass-errs BYPASS - Bypass switch options: See Above - - -B, --backtrack Backtrack to previous position, remove files that would be created in this job. - -w WORKDIR, --workdir WORKDIR - Working directory for pipeline - -g GROUPDIR, --groupdir GROUPDIR - Group directory for pipeline - -p PROJ_DIR, --proj_dir PROJ_DIR - Project directory for pipeline - -G GROUPID, --groupID GROUPID - Group identifier label - -t TIME_ALLOWED, --time-allowed TIME_ALLOWED - Time limit for this job - -M MEMORY, --memory MEMORY - Memory allocation for this job (i.e "2G" for 2GB) - -s SUBSET, --subset SUBSET - Size of subset within group - -r REPEAT_ID, --repeat_id REPEAT_ID - Repeat id (main if first time running, _ otherwise) - -n NEW_VERSION, --new_version NEW_VERSION - If present, create a new version - -m MODE, --mode MODE Print or record information (log or std) - -======================= -Assessor Tool Operation -======================= - -Perform assessments of groups within the pipeline using the ``assess.py`` script. - -.. code-block:: python - - usage: assess.py [-h] [-B] [-R REASON] [-s OPTION] [-c CLEANUP] [-U UPGRADE] [-l] [-j JOBID] [-p PHASE] [-r REPEAT_ID] [-n NEW_ID] [-N NUMBERS] [-e ERROR] [-E] [-W] - [-O] [-w WORKDIR] [-g GROUPDIR] [-v] [-m MODE] - operation groupID - - Run a pipeline step for a single dataset - - positional arguments: - operation Operation to perform - choose from ['progress', 'blacklist', 'upgrade', 'summarise', 'display', 'cleanup', 'match', - 'status_log'] - groupID Group identifier code for the group on which to operate. - - options: - -h, --help show this help message and exit - -B, --blacklist Use when saving project codes to the blacklist - - -R REASON, --reason REASON - Provide the reason for handling project codes when saving to the blacklist or upgrading - -s OPTION, --show-opts OPTION - Show options for jobids, labels, also used in matching and status_log. - -c CLEANUP, --clean-up CLEANUP - Clean up group directory of errors/outputs/labels - -U UPGRADE, --upgrade UPGRADE - Upgrade to new version - -l, --long Show long error message (no concatenation) - -j JOBID, --jobid JOBID - Identifier of job to inspect - -p PHASE, --phase PHASE - Pipeline phase to inspect - -r REPEAT_ID, --repeat_id REPEAT_ID - Inspect an existing ID for errors - -n NEW_ID, --new_id NEW_ID - Create a new repeat ID, specify selection of codes by phase, error etc. - -N NUMBERS, --numbers NUMBERS - Show project code IDs for lists of codes less than the N value specified here. - -e ERROR, --error ERROR - Inspect error of a specific type - -E, --examine Examine log outputs individually. - -W, --write Write outputs to files - -O, --overwrite Force overwrite of steps if previously done - -w WORKDIR, --workdir WORKDIR - Working directory for pipeline - -g GROUPDIR, --groupdir GROUPDIR - Group directory for pipeline - -v, --verbose Print helpful statements while running - -m MODE, --mode MODE Print or record information (log or std) \ No newline at end of file diff --git a/docs/source/extras.rst b/docs/source/extras.rst deleted file mode 100644 index de6e1dc..0000000 --- a/docs/source/extras.rst +++ /dev/null @@ -1,18 +0,0 @@ -Padocc Utility Scripts -====================== - -========= -Utilities -========= - -.. automodule:: pipeline.utils - :members: - :show-inheritance: - -======= -Logging -======= - -.. automodule:: pipeline.logs - :members: - :show-inheritance: \ No newline at end of file diff --git a/docs/source/group_source.rst b/docs/source/group_source.rst new file mode 100644 index 0000000..7f493e6 --- /dev/null +++ b/docs/source/group_source.rst @@ -0,0 +1,11 @@ +======================================== +GroupOperation Core and Mixin Behaviours +======================================== + +Source code for group operations and mixin behaviours. + +.. automodule:: padocc.operations.group + :members: + +.. automodule:: padocc.operations.mixins + :members: \ No newline at end of file diff --git a/docs/source/groups.rst b/docs/source/groups.rst new file mode 100644 index 0000000..8163057 --- /dev/null +++ b/docs/source/groups.rst @@ -0,0 +1,87 @@ +Groups in PADOCC +================ + +The advantage of using PADOCC over other tools for creating cloud-format files is the scalability built-in, with parallelisation and deployment in mind. +PADOCC allows the creation of groups of datasets, each with N source files, that can be operated upon as a single entity. +The operation can be applied to all or a subset of the datasets within the group with relative ease. Here we outline some basic functionality of the ``GroupOperation``. +See the source documentation page for more detail. + +Instantiating a Group +--------------------- + +A group is most easily created using a python terminal or Jupyter notebook, with a similar form to the below. + +.. code-block:: python + + from padocc.operations import GroupOperation + + my_group = GroupOperation( + 'mygroup', + workdir='path/to/dir', + verbose=1 + ) + +At the point of defining the group, all required files and folders are created on the file system with default +or initial values for some parameters. Further processing steps which incur changes to parameters will only be saved +upon completion of an operation. If in doubt, all files can be saved with current values using ``.save_files()`` +for the group. + +This is a blank group with no attached parameters, so the initial values in all created files will be blank or templated +with default values. To fill the group with actual data, we need to initialise from an input file. + +.. note:: + + In the future it will be possible to instantiate from other file types or records (e.g STAC) but for now the accepted + format is a csv file, where each entry fits the format: + ``project_code, /file/pattern/**/*.nc, /path/to/updates.json or empty, /path/to/removals.json or empty`` + +Initialisation from a File +-------------------------- + +A group can be initialised from a CSV file using: + +.. code-block:: python + + my_group.init_from_file('/path/to/csv.csv') + +Substitutions can be provided here if necessary, of the format: + +.. code-block:: python + + substitutions = { + 'init_file': { + 'swap/this/for':'this' + }, + 'dataset_file': { + 'swap/that/for':'that' + }, + 'datasets': { + 'swap/that/for':'these' + }, + } + +Where the respective sections relate to the following: + - Init file: Substitutions to the path to the provided CSV file + - Dataset file: Substitutions in the CSV file, specifically with the paths to ``.txt`` files or patterns. + - Datasets: Substitutions in the ``.txt`` file that lists each individual file in the dataset. + +Applying an operation +--------------------- + +Now we have an initialised group, in the same group instance we can apply an operation. + +.. code-block:: python + + mygroup.run('scan', mode='kerchunk') + +The operation/phase being applied is a positional argument and must be one of ``scan``, ``compute`` or ``validate``. +(``ingest/catalog`` may be added with the full version 1.3). There are also several keyword arguments that can be applied here: + - mode: The format to use for the operation (default is Kerchunk) + - repeat_id: If subsets have been produced for this group, use the subset ID, otherwise this defaults to ``main``. + - proj_code: For running a single project code within the group instead of all groups. + - subset: Used in combination with project code, if both are set they must be integers where the group is divided into ``subset`` sections, and this operation is concerned with the nth one given by ``proj_code`` which is now an integer. + - bypass: BypassSwitch object for bypassing certain errors (see the Deep Dive section for more details) + +Merging or Unmerging +-------------------- +**currently in development - alpha release** \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index ad097fc..f384603 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -6,7 +6,7 @@ PADOCC - User Documentation ============================ -**padocc** (Pipeline to Aggregate Data for Optimised Cloud Capabilites) is a Python package (formerly **kerchunk-builder**) for aggregating data to enable methods of access for cloud-based applications. +**padocc** (Pipeline to Aggregate Data for Optimised Cloud Capabilites) is a Python package for aggregating data to enable methods of access for cloud-based applications. The pipeline makes it easy to generate data-aggregated access patterns in the form of Reference Files or Cloud Formats across different datasets simultaneously with validation steps to ensure the outputs are correct. @@ -14,49 +14,50 @@ Vast amounts of archival data in a variety of formats can be processed using the Currently supported input file formats: - NetCDF/HDF - - GeoTiff (**coming soon**) - - GRIB (**coming soon**) + - GeoTiff + - GRIB - MetOffice (**future**) -*padocc* is capable of generating both reference files with Kerchunk (JSON or Parquet) and cloud formats like Zarr. +*padocc* is capable of generating both reference files with Kerchunk (JSON or Parquet) and cloud formats like Zarr. +Additionally, PADOCC creates CF-compliant aggregation files as part of the standard workflow, which means you get CFA-netCDF files as standard! +You can find out more about Climate Forecast Aggregations `here `_, these files are denoted with the extension ``.nca`` and can be opened using xarray with ``engine="CFA"`` if you have the ``CFAPyX`` package installed. -The pipeline consists of four central phases, with an additional phase for ingesting/cataloging the produced Kerchunk files. This is not part of the code-base of the pipeline currently but could be added in a future update. +The pipeline consists of three central phases, with an additional phase for ingesting/cataloging the produced Kerchunk files. +These phases represent operations that can be applied across groups of datasets in parallel, depending on the architecture of your system. +For further information around configuring PADOCC for parallel deployment please contact `daniel.westwood@stfc.ac.uk `_. + +The ingestion/cataloging phase is not currently implemented for public use but may be added in a future update. .. image:: _images/pipeline.png - :alt: Stages of the Kerchunk Pipeline + :alt: Stages of the PADOCC workflow .. toctree:: :maxdepth: 1 :caption: Contents: - Introduction + Inspiration + Steps to Run Padocc Getting Started - Example CCI Water Vapour - Padocc Flags/Options - Assessor Tool Overview - Error Codes + Example Operation + A Deep Dive Developer's Guide .. toctree:: :maxdepth: 1 - :caption: CLI Tool Source: + :caption: Operations: - Assessor Source - Control Scripts Source + The Project Operator + The Group Operator + SHEPARD .. toctree:: :maxdepth: 1 - :caption: Pipeline Source: + :caption: PADOCC Source: + + Projects + Groups + Filehandlers, Logs, and Utilities - Initialisation - Scanning - Compute - Validate - Allocations - Utils - - - Indices and Tables ================== @@ -72,9 +73,7 @@ PADOCC was developed at the Centre for Environmental Data Analysis, supported by .. image:: _images/ceda.png :width: 300 :alt: CEDA Logo - :width: 300 .. image:: _images/esa.png :width: 300 :alt: ESA Logo - :width: 300 diff --git a/docs/source/init.rst b/docs/source/init.rst deleted file mode 100644 index 1b3bdbc..0000000 --- a/docs/source/init.rst +++ /dev/null @@ -1,6 +0,0 @@ -===================== -Initialisation Module -===================== - -.. automodule:: pipeline.init - :members: \ No newline at end of file diff --git a/docs/source/inspiration.rst b/docs/source/inspiration.rst new file mode 100644 index 0000000..dd31799 --- /dev/null +++ b/docs/source/inspiration.rst @@ -0,0 +1,52 @@ +Inspiration for Cloud Formats and Aggregations +============================================== + +Data Archives +------------- + +The need for cloud-accessible analysis-ready data is increasing due to high demand for cloud-native applications and wider usability of data. +Current archival formats and access methods are insufficient for an increasing number of user needs, especially given the volume of data being +produced by various projects globally. + +.. image:: _images/CedaArchive0824.png + :alt: Contents of the CEDA Archive circa August 2024. + :align: center + +The CEDA-operated JASMIN data analysis facility has a current (2024) data archive of more than 30 Petabytes, with more datasets being ingested +daily. Around 25% of all datasets are in NetCDF/HDF formats which are well-optimised for HPC architecture, but do not typically perform as well +and are not as accessible for cloud-based applications. The standard NetCDF/HDF python readers for example require direct access to the source +files, so are not able to open files stored either in Object Storage (S3) or served via a download service, without first downloading the whole file. + +Distributed Data +---------------- + +The aim of distributed data aggregations is to make the access of data more effective when dealing with these vast libraries of data. +Directly accessing the platforms, like JASMIN, where the data is stored is not necessarily possible for all users, and we would like to avoid the dependence +on download services where GB/TBs of data is copied across multiple sites. Instead, the data may be accessed via a **reference/aggregation file** which provides +the instructions to fetch portions of the data, and applications reading the file are able to load data as needed rather than all at once (Lazy Loading). + +.. image:: _images/DataDistributed.png + :alt: A diagram of how the typcial Distributed Data methods operate. + :align: center + +Formats which provide effective remote data access are typically referred to as **Cloud Optimised Formats** (COFs) like `Zarr `_ and `Kerchunk `_, as in the diagram above. +Zarr stores contain individual **binary-encoded** files for each chunk of data in memory. Opening a Zarr store means accessing the top-level metadata which +informs the application reader how the data is structured. Subsequent calls to load the data will then only load the appropriate memory chunks. Kerchunk +functions similarly as a pointer to chunks of data in another location, however Kerchunk only references the existing chunk structure within NetCDF files, +rather than having each chunk as a separate file. + +PADOCC supports an additional format called CFA, which takes elements of both of these methods. CFA files store references to portions of the array, rather than ranges of bytes of compressed/uncompressed data like with Kerchunk. +These references are stored in NetCDF instead of JSON metadata files, which has the advantage of lazily-loaded references from a single file. Read more about CF Aggregations `here `_. + +A workflow for data conversion +------------------------------ + +PADOCC is a tool being actively developed at CEDA to enable large-scale conversion of archival data to some of these new cloud formats, to address the issues above. +Originally created as part of the ESA Climate Change Initiative project, PADOCC is steadily growing into an essential part of the CEDA ingestion pipeline. +New datasets deposited into the CEDA archive will soon be automatically converted by PADOCC and represented as part of the growing STAC catalog collection at CEDA. +Use of the catalogs is facilitated by the `CEDA DataPoint `_ package, which auto-configures for multiple different file types. + +The result of this new data architecture will be that users of CEDA data can discover and access data through our packages much faster and more efficiently than before, +without the need to learn to use many new formats. All the nuances of each dataset are handled by DataPoint, and use products created by PADOCC to facilitate fast search/access +to the data. + diff --git a/docs/source/misc_source.rst b/docs/source/misc_source.rst new file mode 100644 index 0000000..23e6bd9 --- /dev/null +++ b/docs/source/misc_source.rst @@ -0,0 +1,31 @@ +Padocc Filehandlers +====================== + +Filehandlers are an integral component of PADOCC on the filesystem. The filehandlers +connect directly to files within the pipeline directories for different groups and projects +and provide a seamless environment for fetching and saving values to these files. + +Filehandlers act like their respective data-types in most or all methods. +For example the ``JSONFileHandler`` acts like a dictionary, but with extra methods to close and save +the loaded data. Filehandlers can also be easily migrated or removed from the filesystem as part of other +processes. + +.. automodule:: padocc.core.filehandlers + :members: + :show-inheritance: + +========= +Utilities +========= + +.. automodule:: padocc.core.utils + :members: + :show-inheritance: + +======= +Logging +======= + +.. automodule:: padocc.core.logs + :members: + :show-inheritance: \ No newline at end of file diff --git a/docs/source/phases.rst b/docs/source/phases.rst new file mode 100644 index 0000000..d165111 --- /dev/null +++ b/docs/source/phases.rst @@ -0,0 +1,118 @@ +============================= +Phases of the PADOCC Pipeline +============================= + +.. image:: _images/padocc.png + :alt: Stages of the PADOCC workflow + +**Initialisation of a Group of Datasets** + + +The pipeline takes a CSV (or similar) input file from which to instantiate a ``GroupOperation``, which includes: + - creating subdirectories for all associated datasets (projects) + - creating multiple group files with information regarding this group. + +Scan +---- + +The first main phase of the pipeline involves scanning a subset of the native source files to determine certain parameters: + +* Ensure source files are compatible with one of the available converters for Kerchunk/Zarr etc.: +* Calculate expected memory (for job allocation later.) +* Calculate estimated chunk sizes and other values. +* Determine suggested file type, including whether to use JSON or Parquet for Kerchunk references. +* Identify Identical/Concat dims for use in **Compute** phase. +* Determine any other specific parameters for the dataset on creation and concatenation. + +A scan operation is performed across a group of datasets/projects to determine specific +properties of each project and some estimates of time/memory allocations that will be +required in later phases. + +The scan phase can be activated with the following: + +.. code:: python + + mygroup = GroupOperation( + 'my-group', + workdir='path/to/pipeline/directory' + ) + # Assuming this group has already been initialised from a file. + + mygroup.run('scan',mode='kerchunk') + +.. automodule:: padocc.phases.scan + :members: + +Compute +------- + +Building the Cloud/reference product for a dataset requires a multi-step process: + +Example for Kerchunk: + +* Create Kerchunk references for each archive-type file. +* Save cache of references for each file prior to concatenation. +* Perform concatenation (abort if concatenation fails, can load cache on second attempt). +* Perform metadata corrections (based on updates and removals specified at the start) +* Add Kerchunk history global attributes (creation time, pipeline version etc.) +* Reconfigure each chunk for remote access (replace local path with https:// download path) + +Computation will either refer to outright data conversion to a new format, +or referencing using one of the Kerchunk drivers to create a reference file. +In either case the computation may be extensive and require processing in the background +or deployment and parallelisation across the group of projects. + +Computation can be executed in serial for a group with the following: + +.. code:: python + + mygroup = GroupOperation( + 'my-group', + workdir='path/to/pipeline/directory' + ) + # Assuming this group has already been initialised and scanned + + mygroup.run('compute',mode='kerchunk') + +.. automodule:: padocc.phases.compute + :members: + :show-inheritance: + +Validate +-------- + +Cloud products must be validated against equivalent Xarray objects from CF Aggregations (CFA) where possible, or otherwise using the original NetCDF as separate Xarray Datasets. + +* Ensure all variables present in original files are present in the cloud products (barring exceptions where metadata has been altered/corrected) +* Ensure array shapes are consistent across the products. +* Ensure data representations are consistent (values in array subsets) + +The validation step produced a two-sectioned report that outlines validation warnings and errors with the data or metadata +around the project. See the documentation on the validation report for more details. + +It is advised to run the validator for all projects in a group to determine any issues +with the conversion process. Some file types or specific arrangements may produce unwanted effects +that result in differences between the original and new representations. This can be identified with the +validator which checks the Xarray representations and identifies differences in both data and metadata. + +.. code:: python + + mygroup = GroupOperation( + 'my-group', + workdir='path/to/pipeline/directory' + ) + # Assuming this group has already been initialised, scanned and computed + + mygroup.run('validate') + + # The validation reports will be saved to the filesystem for each project in this group + # as 'data_report.json' and 'metadata_report.json' + +.. automodule:: padocc.phases.validate + :members: + +Next Steps +---------- + +Cloud products that have been validated are moved to a ``complete`` directory with the project code as the name, plus the revision identifier `abX.X` - learn more about this in the deep dive section. +These can then be linked to a catalog or ingested into the CEDA archive where appropriate. diff --git a/docs/source/pipeline-overview.rst b/docs/source/pipeline-overview.rst deleted file mode 100644 index 3d1ddc1..0000000 --- a/docs/source/pipeline-overview.rst +++ /dev/null @@ -1,45 +0,0 @@ -Overview of Pipeline Phases -=========================== - -.. image:: _images/pipeline.png - :alt: Stages of the Kerchunk Pipeline - -**Init (Initialisation) Phase** - -The pipeline takes a CSV (or similar) input file and creates the necessary directories and config files for the pipeline to being running. - -**Scan Phase** - -Second phase of the pipeline involves scanning a subset of the NetCDF/HDF/Tiff files to determine certain parameters: - -* Ensure NetCDF/HDF/Tiff files can be converted successfully using one of the available drivers: -* Calculate expected memory (for job allocation later.) -* Calculate estimated chunk sizes and other values. -* Determine file-type (JSON or Parquet) for final Kerchunk file. -* Identify Identical/Concat dims for use in **Compute** phase. -* Determine any other specific parameters for the dataset on creation and concatenation. - -**Compute Phase** - -Building the Kerchunk file for a dataset requires a multi*step process: - -* Create Kerchunk references for each archive-type file. -* Save cache of references for each file prior to concatenation. -* Perform concatenation (abort if concatenation fails, can load cache on second attempt). -* Perform metadata corrections (based on updates and removals specified at the start) -* Add Kerchunk history global attributes (creation time, pipeline version etc.) -* Reconfigure each chunk for remote access (replace local path with https:// download path) - -**Validation Phase** - -Kerchunk files must be validated against equivalent Xarray objects from the original NetCDF: - -* Ensure all variables present in original files are present in Kerchunk (barring exceptions) -* Ensure array shapes are consistent across Kerchunk/NetCDF -* Ensure data representations are consistent (values in array subsets) - -Several options and switches can be configured for the validation step, see the BypassSwitch class. - -**Next Steps** - -Kerchunk files that have been validated are moved to a ``complete`` directory with the project code as the name, plus the kerchunk revision `krX.X`. These can then be linked to a catalog or ingested into the CEDA archive where appropriate. diff --git a/docs/source/project_source.rst b/docs/source/project_source.rst new file mode 100644 index 0000000..2420811 --- /dev/null +++ b/docs/source/project_source.rst @@ -0,0 +1,11 @@ +========================================== +ProjectOperation Core and Mixin Behaviours +========================================== + +Source code for individual project operations and mixin behaviours. + +.. automodule:: padocc.core.project + :members: + +.. automodule:: padocc.core.mixins + :members: \ No newline at end of file diff --git a/docs/source/projects.rst b/docs/source/projects.rst new file mode 100644 index 0000000..6edf7cd --- /dev/null +++ b/docs/source/projects.rst @@ -0,0 +1,61 @@ +Projects in PADOCC +================== + +To differentiate syntax of datasets/datafiles with other packages that have varying definitions of those terms, +PADOCC uses the term ``Project`` to refer to a set of files to be aggregated into a single 'Cloud Product'. + +The ``ProjectOperation`` class within PADOCC allows us to access all information about a specific dataset, including +fetching data from files within the pipeline directory. This class also inherits from several Mixin classes which +act as containers for specific behaviours for easier organisation and future debugging. + +Directory Mixin +--------------- + +The directory mixin class contains all behaviours relating to creating directories within a project (or group) in PADOCC. +This includes the inherited ability for any project to create its parent working directory and group directory if needed, as well +as a subdirectory for cached data files. The switch values ``forceful`` and ``dryrun`` are also closely tied to this +container class, as the creation of new directories may be bypassed/forced if they exist already, or bypassed completely in a dry run. + +Evaluations Mixin +----------------- + +Previously, all evaluations were handled by an assessor module (pre 1.3), but this has now been reorganised +into a mixin class for the projects themselves, meaning any project instance has the capacity for self-evaluation. The routines +grouped into this container class relate to the self analysis of details and parameters of the project and various +files: + - get last run: Determine the parameters used in the most recent operation for a project. + - get last status: Get the status of the most recent (completed) operation. + - get log contents: Examine the log contents for a specific project. + +This list will be expanded in the full release version 1.3 to include many more useful evaluators including +statistics that can be averaged across a group. + +Properties Mixin +---------------- + +A collection of dynamic properties about a specific project. The Properties Mixin class abstracts any +complications or calculations with retrieving specific parameters; some may come from multiple files, are worked out on-the-fly +or may be based on an external request. Properties currently included are: + - Outpath: The output path to a 'product', which could be a zarr store, kerchunk file etc. + - Outproduct: The name of the output product which includes the cloud format and version number. + - Revision/Version: Abstracts the construction of revision and version numbers for the project. + - Cloud Format: Kerchunk/Zarr etc. - value stored in the base config file and can be set manually for further processing. + - File Type: Extension applied to the output product, can be one of 'json' or 'parquet' for Kerchunk products. + - Source Format: Format(s) detected during scan - retrieved from the detail config file after scanning. + +The properties mixin also enables a manual adjustment of some properties, like cloud format or file type, but also enables +minor and major version increments. This will later be wrapped into an ``Updater`` module to enable easier updates to +Cloud Product data/metadata. + +The Project Operator class +-------------------------- + +The 'core' behaviour of all classes is contained in the ``ProjectOperation`` class. +This class has public UI methods like ``info`` and ``help`` that give general information about a project, +and list some of the other public methods available respectively. + +Key Functions: + - Acts as an access point to all information and data about a project (dataset). + - Can adjust values within key files (abstracted) by setting specific parameters of the project instance and then using ``save_files``. + - Enables quick stats gathering for use with group statistics calculations. + - Can run any process on a project from the Project Operator. \ No newline at end of file diff --git a/docs/source/scan.rst b/docs/source/scan.rst deleted file mode 100644 index db892f7..0000000 --- a/docs/source/scan.rst +++ /dev/null @@ -1,6 +0,0 @@ -============== -Scanner Module -============== - -.. automodule:: pipeline.scan - :members: \ No newline at end of file diff --git a/docs/source/shepard.rst b/docs/source/shepard.rst new file mode 100644 index 0000000..4f93322 --- /dev/null +++ b/docs/source/shepard.rst @@ -0,0 +1,16 @@ +The SHEPARD Module +================== + +The latest development in the PADOCC package is the SHEPARD Module (coming in 2025). + +SHEPARD (Serial Handler for Enabling PADOCC Aggregations via Recurrent Deployment) is a component +designed as an entrypoint script within the PADOCC environment to automate the operation of the pipeline. +Groups of datasets (called ``flocks``) can be created by producing input files and placing them in a persistent +directory accessible to a deployment of PADOCC. The deployment operates an hourly check in this directory, +and picks up any added files or any changes to existing files. The groups specified can then be automatically +run through all sections of the pipeline. + +The deployment of SHEPARD at CEDA involves a Kubernetes Pod that has access to the JASMIN filesystem as well as +the capability to deploy to JASMIN's LOTUS cluster for job submissions. The idea will be for SHEPARD to run +continuously, slowly processing large sections of the CEDA archive and creating cloud formats that can be utilised +by other packages like DataPoint (see the Inspiration tab) that provide fast access to data. \ No newline at end of file diff --git a/docs/source/start.rst b/docs/source/start.rst index b801f3e..bcff1bf 100644 --- a/docs/source/start.rst +++ b/docs/source/start.rst @@ -13,6 +13,10 @@ If you need to clone the repository, either simply clone the main branch of the git clone git@github.com:cedadev/padocc.git +.. note:: + + The instructions below are specific to version 1.3 and later. To obtain documentation for pre-1.3, please contact `daniel.westwood@stfc.ac.uk `_. + Step 1: Set up Virtual Environment ---------------------------------- @@ -22,7 +26,7 @@ Step 1 is to create a virtual environment and install the necessary packages wit python -m venv name_of_venv; source name_of_venv/bin/activate; - pip install -r requirements.txt; + pip install ./; Step 2: Environment configuration @@ -32,11 +36,10 @@ Create a config file to set necessary environment variables. (Suggested to place .. code-block:: console export WORKDIR = /path/to/kerchunk-pipeline - export SRCDIR = /gws/nopw/j04/cedaproc/kerchunk_builder/kerchunk-builder - export KVENV = $SRCDIR/kvenv + export KVENV = /path/to/virtual/environment/venv -Now you should be set up to run the pipeline properly. For any of the pipeline scripts, running ```python