Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying the Organizational Structure of GeoZarr #34

Open
christophenoel opened this issue Jan 25, 2024 · 7 comments
Open

Specifying the Organizational Structure of GeoZarr #34

christophenoel opened this issue Jan 25, 2024 · 7 comments

Comments

@christophenoel
Copy link

christophenoel commented Jan 25, 2024

ℹ️ Edit: This post has been updated to more accurately capture my original message's intent

One of the foundational steps in developing GeoZarr specifications should involve detailing its organizational structure (typically based on the Zarr objects). The initial version of GeoZarr outlines the GeoZarr Classes but doesn't detail the data model storage strucure and format.
GeoZarr conventions rely on XArray (including its terminology which borrows from CF conventions) which itself does not document explicitly the format.

Implicit structure of GeoZarr/ xArray Zarr

GeoZarr organizes data in a way that is compatible with the structure of Zarr. This structure should be clearly defined, similar to how it is done in the documentation for NCZarr.

Example of structure:

SMOS.zarr/
├── .zgroup
├── .zattrs
├── .zmetadata
├── sea_ice_thickness/
│   ├── .zarray
│   └── .zattrs
├── time/
│   ├── .zxarray
│   └── .zattrs
├── x/
└── y/

Here’s a simplified breakdown of how GeoZarr organizes its data, using XArray concepts as a foundation:

  • 📂 GeoZarr Dataset (represents a 'product' which contains multiple variables, and children dataset): Maps to a Zarr Group holding multiple types of data (variables) and possibly other datasets (sub-groups). It stores dataset-wide metadata in a file (.zattrs) and outlines the structure of its contents based on children items (data arrays, coordinates, etc.) in another file (.zmetadata).

Dataset .zattrs

{
    "date_created": "Mon Dec 12 09:29:59 2022",
    "grid": "NSIDC polar stereographic projection. https://nsidc.org/data/polar-stereo/ps_grids.html ",
    "institution": "Alfred-Wegener-Institut Helmholtz Zentrum (AWI)",
    "platform": "ESA Soil Moisture and Ocean Salinity (SMOS) mission",
    "processing_level": "l3c",
    "product_version": "3.3",
}
  • 🔢GeoZarr DataArray (represent the actual data of an obervation 'variable'): Maps to a Zarr Array and includes details about how the data is chunked, compressed, and its default fill value in a metadata file (.zarray). Additionally, it holds geospatial information (e.g., type of observation, units, CF conventions) in another metadata file (.zattrs), including:
    • observation type, unit of measure or other CF conventions
    • _ARRAY_DIMENSIONS : provides the name of dimensions (which siblings provides the coordinates)
    • grid_mapping: mapping of data to geographical projections (based on CF)

Data Array .zattrs

{
    "_ARRAY_DIMENSIONS": ["time","y","x"],
    "coordinates": "longitude latitude",
    "long_name": "SMOS sea ice thickness",
    "standard_name": "sea ice thickness",
    "units": "m"
    "grid_mapping": [...]
}
  • 🌍 GeoZarr Coordinate (defines how to label the dimensions of data arrays): Maps to a Zarr Array and includes metadata (.zattrs) specifying CF attributes and the dimensions it relates to (ensuring it matches the size of the dimensions of the data arrays it references).

An explicit explanation of how coordinates work within the GeoZarr context—especially their interaction with data arrays and how they enable spatial indexing could provide clarity.

Coordinate .zattrs

{
    "_ARRAY_DIMENSIONS": [
        "x"
    ],
    "grid_spacing": "12.5 km",
    "long_name": "x coordinate of projection",
    "standard_name": "projection_x_coordinate",
    "units": "km"

Structure Overview

With SMOS dataset example:

image

Structure Specification

🔍 The new structure might differ from XArray's typical approach. For example, the following changes may be considered :

  • If relevant, additional object (.zcoordinate) might be defined alternatively to zattrs (to be assessed)
  • As consolidated metadata (.zmetadata) is considered to be abandonned, the variables and dimensions could be explicitly listed in a index (especially if additional assets are defined by the GeoZarr spec)
  • Additional types of coordinates might be defined (e.g., origin/offset) and how to represent them.
  • It may be helpful to specify in which case a dataset has children dataset and how to index the hierarchy (Combined with STAC object, the GeoZarr Dataset might be restricted to a single variable and only contains multiscales as children alternative variables).

Original (geo) Zarr discussions

The following old discussions related to the conventions initally created by xarray, NCZarr, etc. may help:

Early draft data model structure spec

🚧 List of statements to be assessed, improved and agreed:

Definitions of core elements:

  • Dataset: is a collection of EO data arrays (one or more) that represents information about a measured or observed geospatial phenomena capture at one or more locations and times. It can encompass various formats and types of data, such as granules (individual data points or images), geospatial time series (3D datasets capturing changes over time), or hyperspectral data (capturing a wide spectrum of light beyond visible light for each pixel).
  • DataArray variables: are the multiple observed variables (e.g., temperature, humidity, elevation) in arrays that rely on a shared coordinate set, ensuring all data align within the same spatial and/or temporal dimensions. TBC: DataArray Variables with heterogenous coordinate set shall be provided in distinct Datasets.
  • AuxiliaryData variables: a GeoZarr Dataset can include Auxliary Variables in arrays providing auxiliary information. The dimensions and coordinate set can be heterogenous.
  • Coordinate variables: a coordinate refers to a variable that gives the actual latitude, longitude, vertical, time, spectrum, and/or custom positions of data points. Coordinates are used to locate data within a multidimensional space and are crucial for interpreting the values of the variables in the dataset
  • GeoZarr Store: is a cohesive collection of GeoZarr Dataset organised as a hierarchy.

Structure of Dataset:

  • Dataset Organisation: A Dataset must correspond to a Zarr Group, containing multiple data arrays (variables).
  • Children Datasets: In some contexts, a Dataset may include sub-groups (children datasets) typically for holding downscaling arrays. ❓
  • Dataset Metadata: Each Dataset must include a .zattrs file to store dataset-wide geospatial metadata, making use as much as possible of CF conventions, and include attribute zgeo set to Dataset.
  • Variables Index: A Dataset must include a coordinate index (JSON array) for referencing the coordinate name of the data array(s) and a data_arrays attribute for listing the name of DataArray variables. ❓

Structure of DataArray:

  • DataArray Organisation: A DataArray must map to a Zarr Array, with metadata detailing chunking, compression, and default fill values stored in a .zarray file.
  • DataArray Metadata: Each DataArray must include a .zattrs file to store array-wide geospatial metadata (making use as much as possible of CF conventions, including observation type, units, ) and include attribute zgeo set to DataArray.. TBD: exact list of recommended CF attributes ❓
  • DataArray Dimensions: the attribute ,_ARRAY_DIMENSIONS shall provide the name of dimensions coordinates (which siblings provides the coordinates) as defined in the Dataset indexes. ❓
  • DataArray Projection: a DataArray must include an attribute to provide the projection of its data in the CF attribute grid-mapping.

Structure of Coordinate:

  • Coordinate Metadata: a Coordinate must include a .zattrs file to store coordinate geospatial metadata (with CF), and include attribute zgeo set to Coordinate.
  • Spatial Indexing and Coordinates Interaction: TBD Specifications should define the interaction between coordinates and data arrays (spatial indexing mechanisms).❓
  • Coordinate System Specification: GeoZarr Coordinates, defining the labeling of data array dimensions, must be represented as Zarr Arrays with metadata in .zattrs specifying CF attributes, dimension relations, grid spacing, and units.
  • Extended Coordinate Types: TBD Specifications may introduce additional types of coordinates (e.g., origin/offset) and detail their representation. ❓
@ethanrd
Copy link

ethanrd commented Feb 7, 2024

The NCZarr convention link above is not the most current version. The most current version is in the netCDF-C docs at this very ugly URL [1].

The main difference is the change to storing NCZarr specific information as extra keys within the Zarr JSON objects (e.g. _nczarr_array in .zarray) instead of the earlier use of non-Zarr JSON objects (like .nczarray and .nczattr).

[1] Sorry for the multiple versions and ugly URL, we are working our way through a big clean-up/reorganization of our netCDF documentation.

@christophenoel christophenoel changed the title Origin of GeoZarr/NCZarr Conventions Specifying the Organizational Structure of GeoZarr Mar 6, 2024
@christophenoel
Copy link
Author

Text edited.

@christophenoel
Copy link
Author

As reported by @ethanrd and agreed, we aim to align GeoZarr terminology whenever possible with CF terminology which itself relies heavily on NetCDF User Guide.

NetCDF

About dataset

A netCDF dataset contains dimensions, variables, and attributes, which all have both a name and an ID number by which they are identified.
(not found a formal definition of dataset)

About group

Groups, like directories in a Unix file system, are hierarchically organized, to arbitrary depth. They can be used to organize large numbers of variables.
Each group acts as an entire netCDF dataset in the classic model. That is, each group may have attributes, dimensions, and variables, as well as other groups.
The default group is the root group, which allows the classic netCDF data model to fit neatly into the new model.

About dimensions

A dimension may be used to represent a real physical dimension, for example, time, latitude, longitude, or height. A dimension might also be used to index other quantities, for example station or model-run-number. A netCDF dimension has both a name and a length.

About variables

Variables are used to store the bulk of the data in a netCDF dataset. A variable represents an array of values of the same type. A scalar value is treated as a 0-dimensional array. A variable has a name, a data type, and a shape described by its list of dimensions specified when the variable is created. A variable may also have associated attributes, which may be added, deleted or changed after the variable is created.

About coordinate variables

A variable with the same name as a dimension is called a coordinate variable. It typically defines a physical coordinate corresponding to that dimension. The above CDL example includes the coordinate variables lat, lon, level and time, defined as follows:

About attributes

NetCDF attributes are used to store data about the data (ancillary data or metadata), similar in many ways to the information stored in data dictionaries and schema in conventional database systems. Most attributes provide information about a specific variable. These are identified by the name (or ID) of that variable, together with the name of the attribute.
Some attributes provide information about the dataset as a whole and are called global attributes. These are identified by the attribute name together with a blank variable name (in CDL) or a special null "global variable" ID (in C or Fortran).
In netCDF-4 file, attributes can also be added at the group level.

CF definitions

auxiliary coordinate variable

Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s).

coordinate variable

We use this term precisely as it is defined in the NUG section on coordinate variables. It is a one-dimensional variable with the same name as its dimension [e.g., time(time)], and it is defined as a numeric data type with values in strict monotonic order (all values are different, and they are arranged in either consistently increasing or consistently decreasing order). Missing values are not allowed in coordinate variables.

@christine-e-smit
Copy link

I'm a little confused by:

  • 📂 GeoZarr Dataset (represents a 'product' which contains multiple variables, and children dataset): Maps to a Zarr Group holding multiple types of data (variables) and possibly other datasets (sub-groups). It stores dataset-wide metadata in a file (.zattrs) and outlines the structure of its contents based on children items (data arrays, coordinates, etc.) in another file (.zmetadata).

I think the .zmetadata is just a consolidated copy of all the metadata in all the .zattrs and .zarray files with the top level .zgroup metadata. That's been my experience with the zarr.convenience.consolidate_metadata function and that's what the documentation says. So the .zmetadata file does show you the structure of all the metadata but that's only because it just has all the metadata.

@christophenoel
Copy link
Author

@christine-e-smit Absolutely, the .zmetadata indeed consolidates all metadata for groups and arrays within the specified store into a singular resource.

This statement in the definition doesn't contradict but rather implies that having this consolidated metadata at the dataset level is mandatory, allowing libraries (like xarray) to understand the structure without needing to read each object individually.

@christophenoel
Copy link
Author

Improvement:

📂 GeoZarr Dataset (represents a 'product' which contains multiple variables, and children dataset): Maps to a Zarr Group holding multiple types of data (variables) and possibly other datasets (sub-groups). It stores dataset-wide metadata in a file (.zattrs) and outlines the structure of its contents based on children items (data arrays, coordinates, etc.) in another file (.zmetadata) through consolidated metadata.

@christophenoel
Copy link
Author

christophenoel commented May 15, 2024

I have not made so much progress, but I would like to share some thoughts about the concept of dataset (coming from xarray, itself based on NetCDF).

The GeoZarr specification must balance two key objectives:

  • Supporting a wide range of source data files (including formats like netCDF with an arbitrary number of group levels).
  • Facilitating compatibility with a wide range of clients or web applications that read GeoZarr (thus specifying some constraints)

For this reason, I think that providing requirements around Dataset (group with coordinates and variables) is essential. It identifies a minimal Zarr structure for interpreting a set of raster variables while still allowing (not excluding) other types of data (e.g., secondary,auxiliary data, point clouds, ...) in other Zarr groups.

For example the conformance class "http://www.opengis.net/spec/ogc-geozarr/1.0/conf/dataset" might include a requirement that defines the minimal aspect that are expected by a client. Following xarray encoding of NetCDF:

Requirement 1 /req/core/dataset
A A GeoZarr may include a GeoZarr dataset at the root Zarr Group level or any children level.
B A GeoZarr dataset must include the coordinates in children Zarr arrays.
C A GeoZarr dataset must include the variables in children Zarr arrays.
D A GeoZarr dataset must include only variables sharing the same coordinates

The relationship with metadata (which is key in Cloud native geospatial), is that I expect a STAC Item/STAC Collection to define asset objects (links) for each dataset, indicating a dedicated dataset media type that informs the client it can be easily displayed on a map, or used in a Jupyter Notebook.

--- Reminder ---

📂 GeoZarr Dataset: is a collection of EO data arrays (one or more) that represents information about a measured or observed geospatial phenomena capture at one or more locations and times. It can encompass various formats and types of data, such as granules (individual data points or images), geospatial time series (3D datasets capturing changes over time), or hyperspectral data (capturing a wide spectrum of light beyond visible light for each pixel).

📦 GeoZarr Group, like Zarr Group, acting as directories in a Unix file system, are hierarchically organized, to arbitrary depth. They can be used to organize large numbers of variables.Each group can have attributes, dimensions, variables, and other nested groups.
A GeoZarr Group may acts as a Dataset and contain multiple Dataset children groups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants