Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unconfigured coordinate variables for SHOC Standard Datasets #132

Closed
sharon-tickell opened this issue May 31, 2024 · 5 comments
Closed

Unconfigured coordinate variables for SHOC Standard Datasets #132

sharon-tickell opened this issue May 31, 2024 · 5 comments

Comments

@sharon-tickell
Copy link

If you open a SHOC standard NetCDF file with emsarray, the xarray.Dataset has properties like:

<xarray.Dataset>
Dimensions:              (k_grid: 48, k_centre: 47, j_grid: 181, i_grid: 601,
                          j_centre: 180, i_centre: 600, j_left: 180,
                          i_left: 601, j_back: 181, i_back: 600, record: 8)
Coordinates:
    z_centre             (k_centre) float64 ...
    y_grid               (j_grid, i_grid) float64 ...
    y_centre             (j_centre, i_centre) float64 ...
    y_left               (j_left, i_left) float64 ...
    y_back               (j_back, i_back) float64 ...
Dimensions without coordinates: k_grid, k_centre, j_grid, i_grid, j_centre,
                                i_centre, j_left, i_left, j_back, i_back, record
Data variables: (12/95)
    z_grid               (k_grid) float64 -4e+03 -3.78e+03 ... 1.0 1e+20
    x_grid               (j_grid, i_grid) float64 142.1 142.2 ... 156.9 156.9
    x_centre             (j_centre, i_centre) float64 142.2 142.2 ... 156.9
    x_left               (j_left, i_left) float64 142.2 142.2 ... 156.9 156.9
    x_back               (j_back, i_back) float64 142.2 142.2 ... 156.9 156.9
    botz                 (j_centre, i_centre) float64 99.0 99.0 ... -4e+03
    ...                   ...
    dens                 (record, k_centre, j_centre, i_centre) float32 0.0 ....
    omega_ar_expose      (record, k_centre, j_centre, i_centre) float32 3.182...
    omega_ar_ex_time     (record, k_centre, j_centre, i_centre) timedelta64[ns] ...
    omega_ar_expose_3p5  (record, k_centre, j_centre, i_centre) float32 1.478...
    source               (record, k_centre, j_centre, i_centre) float32 0.0 ....
    Age                  (record, k_centre, j_centre, i_centre) float32 0.257...
Attributes: (12/18)
    title:                    GBR4_H2p0_B3p1_Cfur_Dnrt
    codehead:                 CSIRO Environmental Modelling Suite
    paramhead:                eReefs 4 km grid. Mile Furnas Catchment constan...
    description:              eReefs GBR 4k grid with rivers. Uses JCU bathy ...
    paramfile:                /datasets/work/ereefs/work/work/cem-worker/nrt_...
    ems_version:              v1.4.0 rev(6949)
    ...                       ...
    nce2:                     180
    nfe1:                     601
    nfe2:                     181
    ns3:                      2370250
    ns2:                      76327
    gridtype:                 NUMERICAL

Note that this includes quite a few dimensions which are not linked to coordinate variables, including the t variable which is actually the time-coordinate variable for the record dimension. Because that coordinate/dim combination is not configured, it is not possible to use xarray time coordinate slicing on a SHOC standard dataset.

It would be great if more (or all!) of the SHOC standard coordinate variables could be configured at

coordinate_names = {

@sharon-tickell sharon-tickell changed the title Configure all coordinate variables for SHOC Standard Datasets Unconfigured coordinate variables for SHOC Standard Datasets May 31, 2024
@sharon-tickell
Copy link
Author

Experiment 1 trying to find a workaround for the time dimension index:

time_coord = ds.ems.time_coordinate
time_dim = time_coord.dims[0]
if len(time_coord.indexes) == 0:
    ds = ds.set_coords(time_coord.name)
    ds = ds.set_index(indexes={time_dim: time_coord.name}, append=True)

this does make the record dimension have a coordinate index from t and allows time-coordinate slicing:

<xarray.Dataset>
Dimensions:              (k_grid: 48, k_centre: 47, j_grid: 181, i_grid: 601,
                          j_centre: 180, i_centre: 600, j_left: 180,
                          i_left: 601, j_back: 181, i_back: 600, record: 8)
Coordinates:
    z_centre             (k_centre) float64 ...
    y_grid               (j_grid, i_grid) float64 ...
    y_centre             (j_centre, i_centre) float64 ...
    y_left               (j_left, i_left) float64 ...
    y_back               (j_back, i_back) float64 ...
  * record               (record) datetime64[ns] 2022-07-30T14:00:00 ... 2022...
Dimensions without coordinates: k_grid, k_centre, j_grid, i_grid, j_centre,
                                i_centre, j_left, i_left, j_back, i_back
Data variables: (12/94)
    z_grid               (k_grid) float64 ...
    x_grid               (j_grid, i_grid) float64 ...
    x_centre             (j_centre, i_centre) float64 ...
    x_left               (j_left, i_left) float64 ...
    x_back               (j_back, i_back) float64 ...
    botz                 (j_centre, i_centre) float64 ...
    ...                   ...
    dens                 (record, k_centre, j_centre, i_centre) float32 ...
    omega_ar_expose      (record, k_centre, j_centre, i_centre) float32 ...
    omega_ar_ex_time     (record, k_centre, j_centre, i_centre) timedelta64[ns] ...
    omega_ar_expose_3p5  (record, k_centre, j_centre, i_centre) float32 ...
    source               (record, k_centre, j_centre, i_centre) float32 ...
    Age                  (record, k_centre, j_centre, i_centre) float32 ...
Attributes: (12/18)
    title:                    GBR4_H2p0_B3p1_Cfur_Dnrt
    codehead:                 CSIRO Environmental Modelling Suite
    paramhead:                eReefs 4 km grid. Mile Furnas Catchment constan...
    description:              eReefs GBR 4k grid with rivers. Uses JCU bathy ...
    paramfile:                /datasets/work/ereefs/work/work/cem-worker/nrt_...
    ems_version:              v1.4.0 rev(6949)
    ...                       ...
    nce2:                     180
    nfe1:                     601
    nfe2:                     181
    ns3:                      2370250
    ns2:                      76327
    gridtype:                 NUMERICAL

But it also uses the dimension name as the coordinate name (they should be different) and removes t from the list of data variables. This subsequently makes calls to ds.ems.time_coordinate fail with an error like SHOC dataset did not have expected time coordinate 't'

@sharon-tickell
Copy link
Author

Experiment 2:

time_coord = ds.ems.time_coordinate
time_dim = time_coord.dims[0]
if len(time_coord.indexes) == 0:
    ds = ds.assign_coords(coords={time_dim: ds[time_coord.name]})

This also sets up an indexed coordinate variable that uses the dimension name:

<xarray.Dataset>
Dimensions:              (k_grid: 48, k_centre: 47, j_grid: 181, i_grid: 601,
                          j_centre: 180, i_centre: 600, j_left: 180,
                          i_left: 601, j_back: 181, i_back: 600, record: 8)
Coordinates:
    z_centre             (k_centre) float64 ...
    y_grid               (j_grid, i_grid) float64 ...
    y_centre             (j_centre, i_centre) float64 ...
    y_left               (j_left, i_left) float64 ...
    y_back               (j_back, i_back) float64 ...
  * record               (record) datetime64[ns] 2022-07-30T14:00:00 ... 2022...
Dimensions without coordinates: k_grid, k_centre, j_grid, i_grid, j_centre,
                                i_centre, j_left, i_left, j_back, i_back
Data variables: (12/95)
    z_grid               (k_grid) float64 -4e+03 -3.78e+03 ... 1.0 1e+20
    x_grid               (j_grid, i_grid) float64 142.1 142.2 ... 156.9 156.9
    x_centre             (j_centre, i_centre) float64 142.2 142.2 ... 156.9
    x_left               (j_left, i_left) float64 142.2 142.2 ... 156.9 156.9
    x_back               (j_back, i_back) float64 142.2 142.2 ... 156.9 156.9
    botz                 (j_centre, i_centre) float64 99.0 99.0 ... -4e+03
    ...                   ...
    dens                 (record, k_centre, j_centre, i_centre) float32 0.0 ....
    omega_ar_expose      (record, k_centre, j_centre, i_centre) float32 3.182...
    omega_ar_ex_time     (record, k_centre, j_centre, i_centre) timedelta64[ns] ...
    omega_ar_expose_3p5  (record, k_centre, j_centre, i_centre) float32 1.478...
    source               (record, k_centre, j_centre, i_centre) float32 0.0 ....
    Age                  (record, k_centre, j_centre, i_centre) float32 0.257...
Attributes: (12/18)
    title:                    GBR4_H2p0_B3p1_Cfur_Dnrt
    codehead:                 CSIRO Environmental Modelling Suite
    paramhead:                eReefs 4 km grid. Mile Furnas Catchment constan...
    description:              eReefs GBR 4k grid with rivers. Uses JCU bathy ...
    paramfile:                /datasets/work/ereefs/work/work/cem-worker/nrt_...
    ems_version:              v1.4.0 rev(6949)
    ...                       ...
    nce2:                     180
    nfe1:                     601
    nfe2:                     181
    ns3:                      2370250
    ns2:                      76327
    gridtype:                 NUMERICAL

BUT unlike the previous experiment, it doesn't remove the t data variable, so calls to ds.ems.time_coordinate still work. It's not quite right (I'm getting the impression that xarray can't cope very well with coordinate variables that have different names to their associated dimensions!), but it allows my use case to function well enough: they just have an extra 'record' coordinate variable floating about.

@sharon-tickell
Copy link
Author

Possibly related? pydata/xarray#4417

@mx-moth
Copy link
Contributor

mx-moth commented Jun 4, 2024

xarray handles indexing in a way that is not useful for SHOC standard datasets. Specifically it will only create indexes for coordinates whos variable name exactly matches the single dimension name it uses. Fixing this is (I think) part of https://github.com/pydata/xarray/projects/1, also see the discussion on the flexible indexing design document

As it stands, emsarray does not modify any datasets when they are opened. Adding an index on the time coordinate is useful, but doing so automatically poses some difficulties. As emsarray is exposed as an xarray accessor it isn't always involved in opening datasets. For instances where the dataset is opened through other means, emsarray is invoked lazily. This means there is no reliable way for emsarray to modify a dataset automatically in a way that will work consistently.

A new method such as dataset.ems.ensure_indexes() could be added that modifies a dataset in place to add all expected indexes. Users could call this method when they first open a dataset or receive the dataset from some other function, before calling any xarray methods that need indexing. I have written a similar function for another project:

def make_indexable(dataset: xarray.Dataset, name: Hashable) -> xarray.Dataset:
    """
    Ensure that `name` is an indexable coordinate in `dataset`.
    This is useful for adding indexes for coordinates whos name does not match the dimension name,
    e.g. a 'time' coordinate defined on the 'record' dimension.
    """
    if name not in dataset.coords:
        dataset = dataset.set_coords([name])
    if name not in dataset.indexes:
        dataset = dataset.set_xindex([name])
    return dataset

dataset = emsarray.open_dataset(...)
dataset = make_indexable(dataset, dataset.ems.get_time_name())

@mx-moth
Copy link
Contributor

mx-moth commented Jul 8, 2024

I am closing this issue, as the issue has little to do with emsarray itself. This has more to do with how xarray chooses what variables to automatically add indexes to. For the cases where xarray does not choose to add an index to a variable, the above comment has an applicable code example. The function uses no emsarray-specific functionality, and it is applicable to all dataset conventions where xarray may not add the desired indexes.

@mx-moth mx-moth closed this as completed Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants