Consider indicating inherited coordinates & dimensions in DataTree repr #9463

shoyer · 2024-09-09T15:55:37Z

What is your issue?

Showing these in the repr makes it harder to understand the DataTree at a glance. In particular, it makes it impossible to immediately see if coordinates are duplicated or not, e.g., consider these two cases, which have the same repr but would serialize to Zarr differently:

In [6]: DataTree.from_dict({'/': Dataset(coords={'x': [1]}), '/sub': Dataset({'foo': 1})})
Out[6]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 1)
│   Coordinates:
│     * x        (x) int64 8B 1
│   Data variables:
│       *empty*
└── Group: /sub
        Dimensions:  (x: 1)
        Coordinates:
          * x        (x) int64 8B 1
        Data variables:
            foo      int64 8B 1

In [8]: DataTree.from_dict({'/': Dataset(coords={'x': [1]}), '/sub': Dataset({'foo': 1}, coords={'x': [1]})})
Out[8]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 1)
│   Coordinates:
│     * x        (x) int64 8B 1
│   Data variables:
│       *empty*
└── Group: /sub
        Dimensions:  (x: 1)
        Coordinates:
          * x        (x) int64 8B 1
        Data variables:
            foo      int64 8B 1

The simplest way to indicate inherited coordinates/dimensions would be to not display them at all. But maybe there is a different way we could indicate such dimensions/variables (less prominently).

The text was updated successfully, but these errors were encountered:

shoyer · 2024-09-09T17:27:42Z

Consider the following DataTree with inherited coordinates:

tree = DataTree.from_dict({
    '/': Dataset(coords={'x': [1]}),
    '/first_child': None,
    '/second_child': Dataset({'foo': ('x', [0])}),
})

Here is the current repr of the root and child nodes:


In [19]: tree
Out[19]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 1)
│   Coordinates:
│     * x        (x) int64 8B 1
│   Data variables:
│       *empty*
├── Group: /first_child
└── Group: /second_child
        Dimensions:  (x: 1)
        Coordinates:
          * x        (x) int64 8B 1
        Data variables:
            foo      (x) int64 8B 0

In [20]: tree['first_child']
Out[20]:
<xarray.DataTree 'first_child'>
Group: /first_child

In [21]: tree['second_child']
Out[21]:
<xarray.DataTree 'second_child'>
Group: /second_child
    Dimensions:  (x: 1)
    Coordinates:
      * x        (x) int64 8B 1
    Data variables:
        foo      (x) int64 8B 0

I think these could be improved by removing inherited coordinate/dimensions, but only if they already already showed on their parent in the same repr. In cases where we are printing a child node only, we should also indicate inherited coordinates/dimensions (because they determine what is valid on the child node). These could look something like:

In [19]: tree
Out[19]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 1)
│   Coordinates:
│     * x        (x) int64 8B 1
├── Group: /first_child
└── Group: /second_child
        Dimensions:  (x: 1)
        Data variables:
            foo      (x) int64 8B 0

In [20]: tree['first_child']
Out[20]:
<xarray.DataTree 'first_child'>
Group: /first_child
    Inherited dimensions:  (x: 1)
    Inherited coordinates:
      * x                  (x) int64 8B 1

In [21]: tree['second_child']
Out[21]:
<xarray.DataTree 'second_child'>
Group: /second_child
    Dimensions:  (x: 1)
    Inherited coordinates:
      * x        (x) int64 8B 1
    Data variables:
        foo      (x) int64 8B 0

TomNicholas · 2024-09-09T17:38:59Z

I think I generally like this suggestion, but the "Inherited dimensions" gets messy if you have a child node with some dimensions inherited and some not. Since dimensions are not in fact inherited (coordinate variables are inherited, which may have new dimensions), I think we should display "Dimensions: (x: 1)" regardless of whether or not the dimension came from an inherited coordinate or a local coordinate.

shoyer · 2024-09-09T17:44:33Z

Since dimensions are not in fact inherited (coordinate variables are inherited, which may have new dimensions)

Dimensions do get inherited. Consider:

In [28]: tree = DataTree(Dataset({'foo': ('x', [1, 2])}))

In [29]: tree = DataTree(Dataset({'foo': ('x', [1, 2])}), children={'child': DataTree()})

In [30]: tree
Out[30]:
<xarray.DataTree>
Group: /
│   Dimensions:  (x: 2)
│   Dimensions without coordinates: x
│   Data variables:
│       foo      (x) int64 16B 1 2
└── Group: /child

In [31]: tree['child'].ds = Dataset({'bar': ('x', [3])})
ValueError: group '/child' is not aligned with its parents:
Group:
    Dimensions:  (x: 1)
    Dimensions without coordinates: x
    Data variables:
        bar      (x) int64 8B 3
From parents:
    Dimensions:  (x: 2)
    Dimensions without coordinates: x

TomNicholas · 2024-09-09T17:51:08Z

My understanding from #9457 (comment) was that inheritance of dimensions during parent-child alignment is an internal implementation detail, and your example above is literally the only possible time that dimension will ever be displayed to the user. For all other intents and purposes I would have said that dimensions are not inherited - they are built from inheritable variables.

shoyer · 2024-09-09T17:55:31Z

My understanding from #9457 (comment) was that inheritance of dimensions during parent-child alignment is an internal implementation detail, and your example above is literally the only possible time that dimension will ever be displayed to the user. For all other intents and purposes I would have said that dimensions are not inherited - they are built from inheritable variables.

I would consider the invalid Dataset object (to make the error message) an internal implementation detail, but the inherited dimension should be considered part of the data model of the child DataTree.

TomNicholas · 2024-09-09T18:04:14Z

but the inherited dimension should be considered part of the data model of the child DataTree.

Maybe for developers, but I'm still not convinced that users should have to care about this distinction. After all, we do not allow users any public way to create a Dataset that has a dimension with no corresponding variable, so from an API perspective I would argue therefore dimensions are not part of the public data model.

To be concrete, if we have a child with both "inherited" and non-inherited dimensions like this

tree = DataTree.from_dict({
    '/': Dataset(coords={'x': [1]}),
    '/child': Dataset({'foo': ('y', [0])}),
})

your suggestion would display the dimensions by separating them into inherited and non-inherited (I've chosen to display them adjacent to the lists of coordinate variables):

In [9]: tree['child']
Out[9]: 
<xarray.DataTree 'child'>
Group: /child
    Inherited Dimensions:  (x: 1)
    Inherited Coordinates:
      * x        (x) int64 8B 1
    Dimensions:  (y: 1)
    Dimensions without coordinates: y
    Data variables:
        foo      (y) int64 8B 0

whereas mine would simply list them all under a common Dimensions heading, with no indication of which came from inherited coordinate variables:

In [9]: tree['child']
Out[9]: 
<xarray.DataTree 'child'>
Group: /child
    Dimensions:  (x: 1, y: 1)
    Inherited Coordinates:
      * x        (x) int64 8B 1
    Dimensions without coordinates: y
    Data variables:
        foo      (y) int64 8B 0

I think one way the latter is nice is that now you can still see all the accessible dimensions on the entire child dataset on one line.

EDIT: If I had 100 inherited coordinate variables, the former layout means I have to scroll down a long way to find out the dimensions of the node's data variables. But if I put everything at the top like this (where the node could have additional non-inherited coordinates in general),

In [9]: tree['child']
Out[9]: 
<xarray.DataTree 'child'>
Group: /child
    Inherited Dimensions:  (x: 1)
    Dimensions:  (y: 1)
    Inherited Coordinates:
      * x        (x) int64 8B 1
      ...
    Coordinates:
      ...
    Dimensions without coordinates: y
    Data variables:
        foo      (y) int64 8B 0
        ...

now Dimensions could potentially be quite far away from the Data variables/Coordinates which depend on those non-inherited Dimensions.

shoyer · 2024-09-09T18:12:43Z

I agree, users don't need to userstand the distinction between dimensions & inherited dimensions. From a user perspective, neither can be edited directly.

In that case, do we always display inherited dimensions as part of Dimensions? Or only if they are not already shown on a parent?

TomNicholas · 2024-09-09T18:19:02Z

In that case, do we always display inherited dimensions as part of Dimensions? Or only if they are not already shown on a parent?

Good question. I think that if our general approach above is motivated by a principle of "only show dimensions immediately adjacent to any variables that actually depend on those dimensions", then for consistency we should only show inherited dimensions if not already shown on a parent.

Fixes pydata#9463

* Update DataTree repr to indicate inheritance Fixes pydata#9463 * fix whitespace * add more repr tests, fix failure * fix failure on windows * fix repr for inherited dimensions

shoyer added needs triage Issue that has not been reviewed by xarray team member topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 9, 2024

shoyer mentioned this issue Sep 9, 2024

Fix inheritance in DataTree.copy() #9457

Merged

2 tasks

TomNicholas mentioned this issue Sep 9, 2024

Clarify which coordinate variables are inherited in the datatree repr #9464

Closed

TomNicholas mentioned this issue Sep 9, 2024

ds.coords.dims can display dims not present in the coordinate vars #9466

Open

shoyer added a commit to shoyer/xarray that referenced this issue Sep 9, 2024

Update DataTree repr to indicate inheritance

67cc75f

Fixes pydata#9463

shoyer mentioned this issue Sep 9, 2024

Update DataTree repr to indicate inheritance #9470

Merged

2 tasks

shoyer closed this as completed in #9470 Sep 10, 2024

shoyer closed this as completed in a6bacfe Sep 10, 2024

This was referenced Sep 12, 2024

Track merging datatree into xarray #8572

Closed

DAS-2155 - Merge datatree documentation into main docs. #9033

Merged

DataTree repr showing inherited coordinates as non-inherited #9499

Closed

TomNicholas mentioned this issue Oct 16, 2024

Update Datatree html repr to indicate inheritance #9633

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider indicating inherited coordinates & dimensions in DataTree repr #9463

Consider indicating inherited coordinates & dimensions in DataTree repr #9463

shoyer commented Sep 9, 2024

shoyer commented Sep 9, 2024

TomNicholas commented Sep 9, 2024 •

edited

Loading

shoyer commented Sep 9, 2024

TomNicholas commented Sep 9, 2024

shoyer commented Sep 9, 2024

TomNicholas commented Sep 9, 2024 •

edited

Loading

shoyer commented Sep 9, 2024

TomNicholas commented Sep 9, 2024

Consider indicating inherited coordinates & dimensions in DataTree repr #9463

Consider indicating inherited coordinates & dimensions in DataTree repr #9463

Comments

shoyer commented Sep 9, 2024

What is your issue?

shoyer commented Sep 9, 2024

TomNicholas commented Sep 9, 2024 • edited Loading

shoyer commented Sep 9, 2024

TomNicholas commented Sep 9, 2024

shoyer commented Sep 9, 2024

TomNicholas commented Sep 9, 2024 • edited Loading

shoyer commented Sep 9, 2024

TomNicholas commented Sep 9, 2024

TomNicholas commented Sep 9, 2024 •

edited

Loading

TomNicholas commented Sep 9, 2024 •

edited

Loading