Feature Request: Hierarchical storage and processing in xarray #4118

emilbiju · 2020-06-01T20:52:25Z

I am using xarray for processing geospatial data and have encountered two major challenges with existing data structures in xarray:

Data arrays stored in an xarray Dataset cannot be grouped into hierarchical levels/logical subsets to reflect the internal organisation of the data. This makes it difficult to identify and process a subset of the data variables that pertain to a specific problem.
When two data arrays having a shared dimension but different coordinate values along the dimension are merged into a Dataset, the union of coordinate values from the 2 data arrays becomes the new coordinate set corresponding to that dimension. Consequently, when the value of a variable in the dataset corresponding to a coordinate value is unknown, nan is used as a substitute which results in memory wastage.

I would like to suggest a tree-based data structure for xarray in which the leaves store individual data arrays and the other nodes store the hierarchical information. Since data arrays are stored independently, each dimension only needs to be associated with coordinate values that are valid for that data array.

To meet these requirements, I have implemented a data structure that also supports the below capabilities:

Standard xarray methods can be applied to the tree at all hierarchical levels, i.e., when a function is called at a hierarchical level, it is mapped over all data arrays that occur at the leaves under the corresponding node. For example, say I have a tree object (lets call it dt) with child nodes: weather, satellite image and population. Each of these nodes has data arrays/subtrees under it.

The mean over time of all data variables associated with weather can be obtained using dt.weather.mean('time') which applies the function to sea_surface_temperature, dew_point_temperature, wind_speed and pressure.

It can be encoded into the netCDF format, like xarray Datasets.
It supports item assignment at all hierarchical levels.

I would like to know of the possibility of introducing such a data structure in xarray and the challenges involved in the same.

The text was updated successfully, but these errors were encountered:

jhamman · 2020-06-01T22:42:10Z

@emilbiju - thanks for opening an issue here. You may want to take a look at the conversation in #1092.

emilbiju · 2020-06-02T08:33:42Z

Thanks @jhamman for sharing the link. Here are my thoughts on the same:

For use-cases similar to the one I have mentioned, I think it would be more meaningful to allow the tree structure (calling it Datatree further) to exist as a separate data structure instead of residing within the Dataset. From what I understand, the xarray Dataset would enforce all its component variables to share the same coordinate set for a given dimension name. This would again result in memory wastage with nan values when the value corresponding to a coordinate is unknown.

Besides, xarray only allows attribute access for getting (and not setting) values, but a separate data structure can allow attribute access for setting values as well. For example, the data structure that I have implemented would allow something like dt.weather = dt.weather.mean('time') to alter all the data arrays under the weather node.

I am currently using attribute-based access for accessing child nodes/data arrays in the Datatree as it appears to reflect the tree structure better, but as @shoyer has pointed out, tuple-based access might be easier to use programmatically.

Instead of using netCDF4 groups for encoding the Datatree, I am currently following a simple 3-step process:

Combine all the data arrays at the leaves of a Datatree object into a dataset.
Add an additional data array to the dataset that would contain an ancestor matrix (or any other array-like representation) that can encode the hierarchical structure with a coordinate set containing names of the tree nodes.
Use the xarray.Dataset.to_netcdf method to store it in a netCDF file.

Therefore, within the netCDF file, it would exist just as a Dataset. A specially implemented Datatree.open_datatree method can open the dataset, detect this additional array and recreate the tree structure to instantiate the object. I would like to know if using netCDF4 groups instead provide any advantages over this approach?

dcherian · 2020-06-02T16:20:09Z

Thanks for writing this up @emilbiju . These are very interesting ideas

The nice thing about using NetCDF groups (or HDF5?) is that it is a standard and your data files are readable using other software.
So far, xarray has been reluctant to add "groups" or this kind of hierarchical organization because of all the additional complexity involved (Dataset groups #1092)
That said, there is definitely interest in a package that provides a high-level object composed of multiple xarray datasets (again Dataset groups #1092). So I encourage you to post your code online so others can try it out and iterate.

a. For example, our friends over at Arviz have a InferenceData structure composed of multiple Datasets that is represented on-disk using NetCDF groups: https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

shoyer · 2020-06-03T21:46:48Z

I would be open to exploring adding a hierarchical data structure into xarray (on an experimental basis, to start), but it would need someone with serious interest and time to make it happen. Certainly there are plenty of use cases across various fields.

shoyer · 2020-06-03T21:52:53Z

The data model you sketch out here looks very similar to what we discussed in #1092. I agree that the semantics are well defined.

The main question in my mind is whether it would make more sense to make an entirely new data structure (e.g., xarray.TreeDataset) or add in a new feature like groups to the existing xarray.Dataset.

Probably a new data structure would be easier at this point, because would keep Dataset simpler and wouldn't break existing code that works on xarray.Dataset.

jhamman · 2021-01-06T18:08:19Z

@joshmoore - based on pangeo-forge/pangeo-forge-recipes#27 (comment), you may be interested in this issue. One way to do multiscale datasets in Xarray would be to use hierarchical groups (one group per scale).

davidbrochart · 2021-01-07T09:56:34Z

a. For example, our friends over at Arviz have a InferenceData structure composed of multiple Datasets that is represented on-disk using NetCDF groups: https://arviz-devs.github.io/arviz/notebooks/XarrayforArviZ.html

Just a note that this link has moved to: https://arviz-devs.github.io/arviz/getting_started/XarrayforArviZ.html

joshmoore · 2021-01-07T15:57:03Z

Thanks for the link, @jhamman. The most immediate issue I ran into when trying to use xarray with OME-Zarr data does seem similar. A rough representation of one multiscale image is:

image_pyramid:
  |_ zyx_array_high_res
  |_ zyx_array_mid_res
  |_ zyx_array_low_res

but of course the x, y and z dimensions are of different sizes in each volume.

thewtex · 2021-02-10T15:58:30Z

@jhamman @joshmoore a prototype to bring together XArray and OME-Zarr/NGFF with multiple groups:
https://github.com/OpenImaging/miqa/blob/master/server/scripts/compress_encode.py

rabernat · 2021-03-17T16:47:20Z

On today's Xarray dev call, we discussed pursuing another CZI grant to support this feature in Xarray. The image pyramid use case would provide a strong link to the bioimaging community. @alexamici and the B-open folks seem enthusiastic.

I had to leave the meeting early, so I didn't hear the end of the conversation. But did we decide who might serve as PI for such a proposal?

dcherian · 2021-03-17T17:10:41Z

But did we decide who might serve as PI for such a proposal?

No.

@emilbiju are you interested in open-sourcing your work?

benbovy · 2021-03-18T09:54:42Z

FWIW, a while ago I wrote a mock-up (and probably outdated) DatasetNode class:

https://gist.github.com/benbovy/92e7c76220af1aaa4b3a0b65374e233a (nbviewer link)

tacaswell · 2021-03-19T14:14:13Z

This is related to some very recent work we have been doing at NSLS-II, primarily lead by @danielballan .

OriolAbril · 2021-03-23T07:16:28Z

Not really sure if there is anything we can do from ArviZ to help with that, if there is let us know and we'll do our best cc @percygautam

aurghs · 2021-03-25T06:41:09Z

@alexamici and I can write the technical part of the proposal.

joshmoore · 2021-03-25T09:21:47Z

Happy to provide assistance on the image pyramid (i.e. "multiscale") use case.

rabernat · 2021-03-25T13:01:56Z

So we have:

Numerous promising prototypes to draw from
A technical team who can write the proposal and execute the proposed work (@aurghs & @alexamici of B-open)
Numerous supporting use cases from the bioimaging (@joshmoore), condensed matter (@tacaswell), and bayesian modeling (ArviZ; @OriolAbril) domains

We are just missing a PI, someone who is willing to put their name on top of the proposal and click submit. I have gone on record as committed to not leading any new proposals this year. And in any case, this is a good opportunity for someone else from the @pydata/xarray core dev team to try on a leadership role.

danielballan · 2021-03-25T13:48:14Z

I volunteer to contribute writing to this from the condensed matter / sychrotron user facility perspective.

dcherian · 2021-03-25T15:30:53Z

I can shoulder part of the load and help is definitely needed. LOI is due on Tuesday. I'll take a stab this evening and post a link.

OriolAbril · 2021-03-26T02:39:24Z

Here are some biomedical papers that are using ArviZ and therefore xarray even if most don't cite xarray and some don't cite ArviZ either. Topics are quite disperse: covid, psychology, biomolecules, oncology...

Some ArviZ recent biomedical citations

Arroyuelo, A., Vila, J., & Martin, O. A. (2020). Exploring the quality of protein structural models from a Bayesian perspective. bioRxiv.
Axen, S. D. (2020). Representing Ensembles of Molecules (Doctoral dissertation, UCSF).
Brauner, J. M., Mindermann, S., Sharma, M., Johnston, D., Salvatier, J., Gavenčiak, T., ... & Kulveit, J. (2021). Inferring the effectiveness of government interventions against COVID-19. Science, 371(6531).
Busch-Moreno, S., Tuomainen, J., & Vinson, D. (2020). Trait Anxiety Effects on Late Phase Threatening Speech Processing: Evidence from EEG. bioRxiv.
Busch-Moreno, S., Tuomainen, J., & Vinson, D. (2021). Semantic and prosodic threat processing in trait anxiety: is repetitive thinking influencing responses?. Cognition and Emotion, 35(1), 50-70.
Dehning, J., Zierenberg, J., Spitzner, F. P., Wibral, M., Neto, J. P., Wilczek, M., & Priesemann, V. (2020). Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions. Science, 369(6500).
Heilbron, E., Martìn, O., & Fumagalli, E. (2020). Efectos protectores de los alimentos andinos contra el daño producido por el alcohol a nivel del epitelio intestinal, una aproximación estadística. Ciencia, Docencia y Tecnología, 31(61 nov-mar).
Legrand, N., Nikolova, N., Correa, C., Brændholt, M., Stuckert, A., Kildahl, N., ... & Allen, M. (2021). The heart rate discrimination task: a psychophysical method to estimate the accuracy and precision of interoceptive beliefs. bioRxiv.
Wang, Y. (2020, September). Data Analysis of Psychological Measurement of Intelligent Internet-assisted Sports Training based on Bio-Sensors. In 2020 International Conference on Smart Electronics and Communication (ICOSEC) (pp. 474-477). IEEE.
WASSERMAN, A., SHRAGER, J., & SHAPIRO, M. A Multilevel Bayesian Model for Precision Oncology.
Weindel, G., Anders, R., Alario, F. X., & Burle, B. (2020). Assessing model-based inferences in decision making with single-trial response time decomposition. Journal of Experimental Psychology: General.
Yamagata, Y. (2020). Simultaneous estimation of the effective reproducing number and the detection rate of COVID-19. arXiv e-prints, arXiv-2005.

shoyer · 2021-03-26T03:24:48Z

I'm excited to see this coming together! I would be happy to advise as well...

Side note: at some point, this would probably be worth adding to Xarray's official roadmap.

aurghs · 2021-03-26T09:09:38Z

We could also provide a use-case in remote sensing: it would be really useful in the interferometric processing for managing Sentinel-1 IW and EW SLC data, which has multiple tiles (burts) partially overlapping in one direction (azimuth).

TomNicholas · 2021-03-26T16:47:53Z

This sounds like an interesting project - I'm also about to be able to work on xarray much more directly (thanks @rabernat ).

Should I add this as another xarray project board alongside explicit indexes and so on?

I wonder if this could find another domain use case in plasmapy as part of the overall plasma object @StanczakDominik? At the very least this would allow you to store all the various equilibrium and diagnostics information that goes in an EFIT file.

StanczakDominik · 2021-03-27T08:55:26Z

Whoa, that sounds awesome! Thanks for the heads up :) Definitely could be quite handy, looking forward to seeing how this develops. @rocco8773 this should be interesting for you as well :)

TomNicholas · 2022-02-14T21:19:56Z

We would like some opinions from the community on two different possible models for a tree-like structure in xarray.

A tree contains many groups, but the question is what constraints should be imposed on the contents of those groups.

Option (1) - Each group is a Dataset
- Means that within each group the same restrictions apply as currently do within a single dataset, i.e. each dimension name is only associated with a single length, so there is effectively a common set of dimensions which variables can depend on.
- Can't represent all files, in particular can't represent a filetype where groups are allowed to have variables with inconsistent length dimensions (e.g. Zarr stores allow this as all arrays are independent.)
- Model maps more directly onto netCDF (though still not exactly, because netCDF has dimensions as separate objects)
- This means that sometimes you might need to put variables in ajdacent groups in the same level of the tree, when you might rather want them together in the same group.
- Enforcing consistency between variables guarantees certain operations are always well-defined (in particular selection via an integer index like in .isel).
- Guarantees that all valid operations on a Dataset are also valid operations on a single group of a DataTree - so API can be essentially identical to Dataset.
- Metadata (i.e. .attrs) are arguably most useful when set at this level
- Mental model is a (nested) dict of Datasets
- Prototype is DataTree
Option (2) - Variables within groups are unconstrained
- Means that within a single group each Variable can have any dimensions, of any length. There is no requirement that two variables which both depend on a dimension called "x" have to have the same length, one variable can have .sizes['x']=10 and the other have .sizes['x']=20.
- The main advantage of this is that it can represent a wider set of files (including all Zarr stores and a wider set of GRIB files)
- Model maps more directly onto HDF5
- Doesn't enforce the (arguably fairly arbitrary) constraint that if variables have a dimension of the same name, that dimension must also be the same length
- Without consistency selection becomes ill-defined, but many other operations are fine (e.g. taking .mean())
- Mental model is a (nested) dict of dicts of DataArrays
- Prototype is xarray-DataGroups

This is by no means the only question, and we have various choices to make within these options.

The questions for the potential users here are:

Do you have use cases which one of these designs could handle but the other couldn't?
How important to you is being able to support all valid files of these certain formats?
Which of these designs is clearer/more intuitive/more appealing to you?

(@alexamici , @shoyer, @jhamman, @aurghs please edit this comment to add anything I've missed)

mraspaud · 2022-02-15T20:48:51Z

Thanks for launching this discussion @TomNicholas !
I'm a core dev of pytroll/satpy which handles earth observing satellite data. I got interested in DataTree because we have data from the same instruments available at mulitple resolution, hence not fitting into a single Dataset.
For use Option 1 is probably feeling better. Even when having data at multiple resolutions, it is still a limited number of resolutions and hence splitting them in groups is the natural way of going I would say.
We do not use the features you mention in Zarr or GRIB, as a majority of the satellite data we use is provided in netcdf nowadays.
Don't hesitate to ask if you want to know more or if something is unclear, we are really interested in these developments, so if we can help that way...

alexamici · 2022-02-17T07:39:15Z

@TomNicholas (cc @mraspaud)

Do you have use cases which one of these designs could handle but the other couldn't?

The two main classes of on-disk formats that, I know of, which cannot be always represented in the "group is a Dataset" approach are:

in netCDF following the CF conventions for groups, it is legal for an array to refer to a dimension or a coordinate in a different group and so arrays in the same group may have dimensions with the same name, but different size / coordinate values, (this was the orginal motivation to explore the DataGroup approach)
the current spec for the Next-generation file formats (NGFF) for bio-imaging has all scales of the same 5D data in the same group. (cc @joshmoore)

I don't have an example at hand, but my impression is that satellite products that use HDF5 file format also place arrays with inconsistent dimensions / coordinates in the same group.

shoyer · 2022-02-17T07:45:24Z

One thing that came up in our discussion about this in the developer meeting today is that we could also pretty easily expose a "low level" API for IO using dictionaries of xarray.Variable objects. This intermediate representation could be useful for cleaning up data into a form suitable for conversion into Dataset objects.

…

On Wed, Feb 16, 2022 at 11:39 PM Alessandro Amici ***@***.***> wrote: @TomNicholas <https://github.com/TomNicholas> (cc @mraspaud <https://github.com/mraspaud>) Do you have use cases which one of these designs could handle but the other couldn't? The two main classes of on-disk formats that, I know of, which cannot be always represented in the "group is a Dataset" approach are: - in netCDF following the CF conventions for groups <https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#groups>, it is legal for an array to refer to a dimension or a coordinate in a different group and so arrays in the same group may have dimensions with the same name, but different size / coordinate values, - the current spec for the Next-generation file formats (NGFF) <https://ngff.openmicroscopy.org> for bio-imaging has all scales of the same 5D data in the same group. I don't have an example at hand, but my impression is that satellite products that use HDF5 file format also place arrays with inconsistent dimensions / coordinates in the same group. — Reply to this email directly, view it on GitHub <#4118 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJJFVT27QD4RQDYZ2N4W7TU3SQ3BANCNFSM4NQEIKFQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

alexamici · 2022-02-17T07:52:17Z

@TomNicholas I also have a few comments on the comparison:

Option (1) - Each group is a Dataset

Model maps more directly onto netCDF (though still not exactly, because netCDF has dimensions as separate objects)

This is only true for flat netCDF files, once you introduce groups in a netCDF AND accept CF conventions the DataGroup approach can map 100% of the files, while the DataTree approach fails on a (admittedly small) class of them.

Enforcing consistency between variables guarantees certain operations are always well-defined (in particular selection via an integer index like in .isel).

Guarantees that all valid operations on a Dataset are also valid operations on a single group of a DataTree - so API can be essentially identical to Dataset.

Both points are only true for the DataArray in a single group, once you broadcast any operation to subgroups the two implementations would share the same limitations (dimensions in subgroups can be inconsistent in both cases).

In my opinion the advantage for the DataTree is minimal.

Metadata (i.e. .attrs) are arguably most useful when set at this level

The two approach are identical in this respect, group attributes are mapped in the same way to DataTree and DataGroup

I share your views on all other points.

kmuehlbauer · 2022-02-17T09:17:55Z

@alexamici

in netCDF following the CF conventions for groups, it is legal for an array to refer to a dimension or a coordinate in a different group and so arrays in the same group may have dimensions with the same name, but different size / coordinate values, (this was the orginal motivation to explore the DataGroup approach)

I'm having difficulties to understand your above point wrt to the scoping rules from the above CF document. Shouldn't it be impossible to create two arrays (in the same group) having dimensions with exactly the same name from different groups? Looking at the example here https://github.com/alexamici/xarray-datagroup there are coordinates with name "/lat" vs "lat". Aren't that two different names? Maybe I'm missing something essential here.

alexamici · 2022-02-17T09:41:29Z

@kmuehlbauer in the representation I use the fully qualified name for the dimension / coordinate, but the corresponding DataArray will use the basename, e.g. both array will have lat as a coordinate. Sorry for the confusion, I need to add more context to the README.

kmuehlbauer · 2022-02-17T09:58:18Z

in the representation I use the fully qualified name for the dimension / coordinate, but the corresponding DataArray will use the basename, e.g. both array will have lat as a coordinate. Sorry for te confusion, I need to add more context to the README.

Thanks for clarifying. I'm wondering if that can be a source of misunderstanding. How should the user differentiate that? I mean finally those dimensions which have the same name lat are different entities and it should be possible to tell them apart somehow. I think I'm slowly getting to the bottom of this (representation in dictionaries, duplicate keys) and I really need to look into the implementation. I'll open an issue over at xarray-datagroup if I have more questions to not clutter the discussion here.

TomNicholas · 2022-02-17T23:47:44Z

This is only true for flat netCDF files, once you introduce groups in a netCDF AND accept CF conventions the DataGroup approach can map 100% of the files, while the DataTree approach fails on a (admittedly small) class of them.

@alexamici can you expand on the role of the CF conventions in this statement? Are you talking about CF conventions allowing one variable in one group to refer to dimension present in another group, or something else?

OriolAbril · 2022-02-18T17:06:57Z

I am not sure I completely understand option 2, but option 1 seems a better fit to what we are doing at ArviZ (so far we are managing quite well with the InferenceData mentioned above which is a collection of independent xarray datasets). In our case, well defined selection for multiple variables at the same time (i.e. at the dataset level) is very useful.

I was also wondering what changes (if any) would each option imply when using apply_ufunc

LunarLanding · 2022-02-22T15:30:00Z

Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.
For this, it would make more sense to be able to have dimensions ( with optional labels and coordinates ) assigned to nodes (and these would be inherited by any descendants). Leaf nodes would hold data.
On merge, dimensions could be bubbled up as long as length (and labels) matched.
Operations with dimensions would then go down to corresponding dimension level before applying the operator, i.e. container['A/B'].mean('time') would be different from container['A'].mean('time')['B'].

Datagroup and Datatree are subcases of this general structure, which could be enforced via flags/checks.
Option 1 is where the extremities of the tree are a node with two sets of child nodes, dimension labels and n-dimensional arrays.
Option 2 is where the extremities of the tree are a node with a child node for a n-dimensional array A, and a sibling node for each dimension of A, containing the corresponding labels.

I'm sure I'm missing some big issue with the mental model I have, for instance I haven't thought of transformations at all and about coordinates. But for clarity I tried to write it down below.

The most general structure for a dataset I can think of is a directed graph.
Each node A is a n-dimensional (sparse) array, where each dimension D points optionally to a one-dimensional node B with the same length.

To get a hierarchical structure, we:

add edges of a different color, each with a label
restrict their graph to a tree T
add labels to each dimension D

We can resolve D's target by (A) checking for a sibling in T with the same name, and then going up one level and goto (A).

Multindexes ( multi-dimensional (sparse) labels ) generalize this model, but require tuple labels in T's edges i.e. :
h/j/a[x,y,z] has a sybling h/j/(x,y)[x,y] , with z's labels being one level above, i.e. h/z[z] ( the notation a[b] means map of index b to value a ).

TomNicholas · 2022-02-22T15:47:15Z

Hi @LunarLanding , thanks for your ideas!

For this, it would make more sense to be able to have dimensions ( with optional labels and coordinates ) assigned to nodes (and these would be inherited by any descendants).

It sounds a bit like what you are suggesting is essentially a model in which dimensions are explicit objects, which can be referred to from other groups, like in netCDF. (NetCDF has "dimension IDs".)

This would be a bit of a departure from the model that xarray.Dataset currently uses, because right now dimensions aren't really unique entities, they are just a collective label for a shared dimension of a set of Variable objects.

Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.

By "variable" length, do you mean that the length of dimensions differs between variables in the same group, or just that you don't know the length of the dimension in advance?

Is there a specific use case which you think would require explicit dimensions to solve?

TomNicholas · 2022-02-22T15:58:48Z

Also thanks @OriolAbril , it's useful to have an ArViz perspective.

I was also wondering what changes (if any) would each option imply when using apply_ufunc

I see apply_ufunc as a Variable-level operation - i.e. it doesn't know about the relationship between different Variables unless you explicit feed it multiple variables. So therefore whether we choose model 1 or 2 probably doesn't affect apply_ufunc much.

In either case I imagine all we might need to do is slightly extend apply_ufunc to also map over variables in a group of a tree if given one, and provide examples of using map_over_subtree or similar to map your apply_ufunc operation over multiple groups in a tree. If the user is trying to do something more complicated (like getting one variable from one level of a tree and another variable from another level, then feeding both into apply_ufunc) then I would just make the user responsible for fetching the variables in that case, and also for putting the results back into the intended place in the tree.

LunarLanding · 2022-03-04T17:46:03Z

Often I run a function over a dataset, with each call outputing a hierarchical data structure, containing fixed dimensions in the best cases and variable length in the worst.

By "variable" length, do you mean that the length of dimensions differs between variables in the same group, or just that you don't know the length of the dimension in advance?

I mean that I might have, for instance, a map from 2 variables to data, ie (x,y)->c, that I can write as a DataArray XY with two dimensions x and y and the values being c.
Then I have a function f so that f(c)->d[g(c)], i.e. it yields an array whose length depends on c.
I wish I could say : apply f to XY, building a variable length array as you get the output. It could be stored as sparse matrice (X,Y,G).
This is a bit out of scope for this discussion; but it is related since creating a differently named group per dimension length is often mentioned as a workaround ( which does not scale when you have a 1000x(variable length dimension) data).

Is there a specific use case which you think would require explicit dimensions to solve?

The use-case is iteratively adding values to a dataset by mapping functions over multiple variables / dimensions in arbitrary compositions.
This happens in the context of data analysis, where you start with some source data and then iteratively create analysis functions, and then want to query / display / do statistics/reductions on the set of original data + analysis.
Explicit hierarchical dimensions allow for merging and referring to data with no collisions in a single datatree/group.

PS: in netcdf-4 dimensions are seen by children, it matches what I previously posted; in HDF5 nodes are hardlinks to the actual data , this might be exactly the xarray-datagroup posted above.

Example of ideal datastructure

The datastructure that is more useful for this kind of analysis is the one that is an arbitrary graph of n-dimensional arrays; forcing the graph to have a hierarchical access allows optional organization; the graph itself can exist as python objects for nodes and references for edges.
If the tree is not necessary/required everything can be placed on the first level, as it is done on a Dataset.

Example:

Notation

a:b value a has type b
t[...,n,...] : type of data array of values of type t, with axis of length n
D(n(,l)) dimension of size n with optional labels l
A(t,*(dims:tuple[D])} : type of data array of values of type t, with dimension dims
a tree node T is either:
- a dict from hashables to tree nodes, dict[Hashable,T]
- a dimension D
- a data array A
a[*tags]:=a[tag[0]][tag[1]]...[tag[len(tag)-1]]
map(f,*args:A,dims:tuple[D]) maps f over args broadcasting over dims

Start with a 2d-dimensional DataArray:

 d0
 (
    Graph : (
        x->D(x_n,float[x_n])
        y->D(y_n)
        v->A(float,x,y)
    )
    Tree : (
        {
            'x':x,
            'y':y,
            'v':v,
        }
    )
 )

Map a function f that introduces a new dimension w with constant labels f_w_l:int[f_w_n] (through map_blocks or apply_ufunc) and add it to d0:

 f : x:float->(
    Graph:
        f_w->D(f_w_n,f_w_l)
        a->A(float,f_w)
        b->A(float)
    Tree:
        {
         'w':f_w,
         'a':a,
         'b':b,
        })

 d1=d0.copy()
 d1['f']=map(
        f,
        d0['v'],
        (d0['x'],d0['y'])
    )
 d1
  (
    Graph :
        x->D(x_n,float[x_n])
        y->D(y_n)
        v->A(float,x,y)
        f_w->D(f_w_n,f_w_l)
        f_a->A(float,x,y,f_w)
        f_b->A(float,x,y)
    Tree :
        {
            'x':x,
            'y':y,
            'v':v,
            'f':{
                'w':f_w,
                'a':f_a,
                'b':f_b,
            }
        }
 )

Map a function g, that has a dimension of the same name but different meaning and therefore possibly different length g_w_n and g_w_l:

g : x:float->(
    Graph:
        g_w->D(g_w_n,g_w_l)
        a->A(float,g_w)
        b->A(float)
    Tree:
        {
         'w':g_w,
         'a':a,
         'b':b,
        })

 d2=d1.copy()
 d2['g']=map(
        g,
        d1['v'],
        (d1['x'],d1['y'])
    )
 d2
  (
    Graph :
        x->D(x_n,float[x_n])
        y->D(y_n)
        v->A(float,x,y)
        f_w->D(f_w_n,f_w_l)
        f_a->A(float,x,y,f_w)
        f_b->A(float,x,y)
        g_w->D(g_w_n,g_w_l)
        g_a->A(float,x,y,g_w)
        g_b->A(float,x,y)
        
    Tree :
        {
            'x':x,
            'y':y,
            'v':v,
            'f':{
                'w':f_w,
                'a':f_a,
                'b':f_b,
            },
            'g':{
                'w':g_w,
                'a':g_a,
                'b':g_b,
            }
            
        }
 )

Notice that both f and g output a dimension named 'w' but that they have different lengths and possibly different meanings.

Suppose I now want to run analysis on f's and g's output, with a function that takes two a's and outputs a float
Then d3 looks like:

h : a1:float,a2:float->(
    Graph:
        r->A(float)
    Tree:
        r

 d3=d2.copy()
 d3['f_g_aa']=map(
        h,
        d2['f','a'],d2['g','a'],
        (d2['x'],d2['y'],d2['f','w'],d2['g','w'])
    )
 d3
 {
    Graph :
        x->D(x_n,float[x_n])
        y->D(y_n)
        v->A(float,x,y)
        f_w->D(f_w_n,f_w_l)
        f_a->A(float,x,y,f_w)
        f_b->A(float,x,y)
        g_w->D(g_w_n,g_w_l)
        g_a->A(float,x,y,g_w)
        g_b->A(float,x,y)
        f_g_aa->A(float,x,y,f_w,g_w)

    Tree :
        {
            'x':x,
            'y':y,
            'v':v,
            'f':{
                'w':f_w,
                'a':f_a,
                'b':f_b,
            },
            'g':{
                'w':g_w,
                'a':g_a,
                'b':g_b,
            }
            'f_g_aa': f_g_aa
        }
 }

Compared to what I posted before, I dropped the resolving the dimension for a array by its position in the hierarchy since it would be innaplicable when a variable refers to dimensions in a different branch of the tree.

tacaswell · 2022-03-04T20:25:47Z

@LunarLanding You may also be interested in awkward array.

jakirkham · 2022-07-29T00:14:46Z

Wanted to note issue ( carbonplan/ndpyramid#10 ) here, which may be of interest to people here.

Also we are thinking about a Dask blogpost in this space if people have thoughts on what we should include and/or are interested in being involved. Details in issue ( dask/dask-blog#141 ).

LunarLanding mentioned this issue Aug 31, 2020

Dataset groups #1092

Closed

jjpr-mit mentioned this issue Sep 2, 2020

dicarlo.BashivanKar2019 brain-score/brainio_collection#39

Merged

Robileo mentioned this issue Dec 9, 2020

Hierarchical labelling of Dataset variables #4665

Closed

keewis mentioned this issue Jan 24, 2021

Opening a dataset doesn't display groups. #4840

Open

joshmoore mentioned this issue Mar 19, 2021

Lift 5d requirement for images and move multiscales description into spec ome/ngff#39

Closed

alexamici mentioned this issue Apr 13, 2021

xr.open_dataset user experience on S1-SLC bopen/xarray-sentinel#4

Closed

ivirshup mentioned this issue Jan 25, 2022

Dimension constraints scverse/mudata#13

Closed

weiji14 mentioned this issue Feb 25, 2022

add read-in functionality for deeply nested variables icesat2py/icepyx#281

Merged

emiliom mentioned this issue Mar 3, 2022

Access Beam_groupX as subgroups of the Sonar group OSOceanAcoustics/echopype#567

Closed

jpivarski mentioned this issue Apr 4, 2022

Alignment with xarray scverse/anndata#744

Open

5 tasks

TomNicholas mentioned this issue May 25, 2022

Example datatree for use in tutorial documentation xarray-contrib/datatree#100

Closed

6 tasks

aseyboldt mentioned this issue Jul 27, 2022

pm.Model(coords=coords) should preserve coordinate type if it is a pandas Index pymc-devs/pymc#5994

Closed

jakirkham mentioned this issue Jul 29, 2022

Blogpost idea: how to generate multiscale image arrays dask/dask-blog#141

Open

TomNicholas mentioned this issue Jan 4, 2023

Import datatree in xarray? #7418

Closed

4 tasks

TomNicholas mentioned this issue Jan 27, 2023

Why are tree nodes and DataArrays treated equally by __getitem__() and __setitem__()? xarray-contrib/datatree#211

Closed

acocac mentioned this issue Oct 4, 2023

Improving format of object returned by DeepSensorModel.predict alan-turing-institute/deepsensor#53

Closed

TomNicholas mentioned this issue Dec 22, 2023

Track merging datatree into xarray #8572

Open

27 tasks

TomNicholas mentioned this issue Aug 13, 2024

Allow symbolic links between datatree nodes? #9340

Open

eni-awowale mentioned this issue Sep 7, 2024

Example datatree for use in tutorial documentation #9437

Open

6 tasks

TomNicholas mentioned this issue Sep 9, 2024

DataTree release blog post xarray-contrib/xarray.dev#708

Open

Feature Request: Hierarchical storage and processing in xarray #4118

Feature Request: Hierarchical storage and processing in xarray #4118

Comments

emilbiju commented Jun 1, 2020

jhamman commented Jun 1, 2020

emilbiju commented Jun 2, 2020

dcherian commented Jun 2, 2020

shoyer commented Jun 3, 2020

shoyer commented Jun 3, 2020 • edited Loading

jhamman commented Jan 6, 2021

davidbrochart commented Jan 7, 2021

joshmoore commented Jan 7, 2021

thewtex commented Feb 10, 2021

rabernat commented Mar 17, 2021

dcherian commented Mar 17, 2021

benbovy commented Mar 18, 2021

tacaswell commented Mar 19, 2021

OriolAbril commented Mar 23, 2021

aurghs commented Mar 25, 2021 • edited Loading

joshmoore commented Mar 25, 2021

rabernat commented Mar 25, 2021 • edited Loading

danielballan commented Mar 25, 2021

dcherian commented Mar 25, 2021

OriolAbril commented Mar 26, 2021

shoyer commented Mar 26, 2021

aurghs commented Mar 26, 2021

TomNicholas commented Mar 26, 2021

StanczakDominik commented Mar 27, 2021

TomNicholas commented Feb 14, 2022 • edited Loading

mraspaud commented Feb 15, 2022 • edited Loading

alexamici commented Feb 17, 2022 • edited Loading

shoyer commented Feb 17, 2022 via email

alexamici commented Feb 17, 2022 • edited Loading

kmuehlbauer commented Feb 17, 2022

alexamici commented Feb 17, 2022 • edited Loading

kmuehlbauer commented Feb 17, 2022

TomNicholas commented Feb 17, 2022

OriolAbril commented Feb 18, 2022

LunarLanding commented Feb 22, 2022 • edited Loading

TomNicholas commented Feb 22, 2022 • edited Loading

TomNicholas commented Feb 22, 2022

LunarLanding commented Mar 4, 2022 • edited Loading

Example:

Notation

tacaswell commented Mar 4, 2022

jakirkham commented Jul 29, 2022

shoyer commented Jun 3, 2020 •

edited

Loading

aurghs commented Mar 25, 2021 •

edited

Loading

rabernat commented Mar 25, 2021 •

edited

Loading

TomNicholas commented Feb 14, 2022 •

edited

Loading

mraspaud commented Feb 15, 2022 •

edited

Loading

alexamici commented Feb 17, 2022 •

edited

Loading

alexamici commented Feb 17, 2022 •

edited

Loading

alexamici commented Feb 17, 2022 •

edited

Loading

LunarLanding commented Feb 22, 2022 •

edited

Loading

TomNicholas commented Feb 22, 2022 •

edited

Loading

LunarLanding commented Mar 4, 2022 •

edited

Loading