Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Name permanence #116

Closed
TomNicholas opened this issue Jun 17, 2022 · 5 comments · Fixed by #172
Closed

Name permanence #116

TomNicholas opened this issue Jun 17, 2022 · 5 comments · Fixed by #172
Labels
bug Something isn't working design question

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Jun 17, 2022

Xarray Datasets don't alter the names of the objects they store:

In [1]: import xarray as xr

In [2]: da = xr.DataArray(name="b")

In [3]: ds = xr.Dataset()

In [4]: ds['a'] = da

In [5]: ds["a"].name
Out[5]: 'a'

In [6]: da.name
Out[6]: 'b'

After #41 (and #115 ensures it) then DataTree objects behave similarly for data variables they store

In [7]: from datatree import DataTree

In [8]: root = DataTree()

In [9]: root["a"] = da

In [10]: root["a"].name
Out[10]: 'a'

In [11]: da.name
Out[11]: 'b'

However, currently DataTree objects do alter the name of child DataTree nodes that they store.

In [12]: subtree = DataTree(name="b")

In [13]: root = DataTree()

In [14]: root["a"] = subtree

In [15]: root["a"].name
Out[15]: 'a'

In [16]: subtree.name
Out[16]: 'a'

I noticed this in #115, but fixing it might be a bit complex.

@TomNicholas TomNicholas added bug Something isn't working design question labels Jun 17, 2022
@TomNicholas
Copy link
Member Author

The reason fixing this is complicated is that Dataset doesn't actually store DataArray objects, instead it stores lower-level Variable objects, which are unnamed. It then recreates a new DataArray object when __getitem__ is called.

Implementing a similar solution with DataTree objects would require DataTree to not store other DataTree objects, but instead store a lower-level unnamed node object. Then when DataTree.__getitem__ was called we would recreate a new DataTree object from the underlying node object.

I'm trying to think about whether that design might cause any other consistency problems though...

@TomNicholas
Copy link
Member Author

@shoyer wondering if you have an opinions on this? Do you think this name behaviour is a subtle but important feature of xarray or a unimportant detail?

@shoyer
Copy link

shoyer commented Jul 12, 2022

I do think this behavior is important. I would definitely consider alternative data structures for internal storage.

@TomNicholas
Copy link
Member Author

Okay I think a design like this would work, where DataNode is analogous to Variable in that it's a private internal class that contains data but doesn't know it's own name.

Notice that you could technically have an entire tree of only DataNodes (because they still store the names of children via dict keys), but we would ensure the user always gets returned a DataTree which points to DataNodes.

from typing import TypeVar, Generic, Optional, OrderedDict, Dict, Mapping, Hashable, Any, Union

from xarray.core.variable import Variable
from xarray.core.utils import Frozen


Tree = TypeVar("Tree", bound="TreeNode")


class TreeNode(Generic[Tree]):
    """Unnamed tree node"""

    _parent: Optional[Tree]
    _children: OrderedDict[str, Tree]

    @property
    def children(self) -> Mapping[str, Tree]:
        return Frozen(self._children)


class NamedNodeMixin:
    """
    Enables a node to know its own name.

    Implements path-like relationships to other nodes in its tree.
    """

    _name: Optional[str]

    @property
    def name(self) -> Union[str, None]:
        """The name of this node."""
        return self._name

    @property
    def path(self) -> str:
        ...

    def relative_to(self: NamedNodeMixin, other: NamedNodeMixin) -> str:
        ...


class DataNode(TreeNode, Generic[Tree]):
    """Unnamed tree node which also stores data."""

    # TODO this could potentially inherit from Dataset?

    _parent: Optional[DataNode]
    _children: OrderedDict[str, DataNode]
    _variables: Dict[Hashable, Variable]
    ...


class DataTree(DataNode, NamedNodeMixin, Generic[Tree]):
    _name: Optional[str]
    _parent: Optional[DataTree]
    _children: OrderedDict[str, DataNode]
    _variables: Dict[Hashable, Variable]
    ...

    @property
    def children(self) -> Mapping[str, DataTree]:
        children_as_datatrees = {
            name: DataTree._construct_from_datanode(
                name=name, datanode=child,
            )
            for name, child in self._children
        }
        return Frozen(children_as_datatrees)

    @classmethod
    def _construct_direct(
            cls,
            name: str | None = None,
            parent: DataTree | None = None,
            children: OrderedDict[str, DataNode] = None,
            variables: dict[Any, Variable] = None,
    ) -> DataTree:
        """Shortcut around __init__ for internal use when we want to skip costly validation."""
        ...

    @classmethod
    def _construct_from_datanode(cls, name: str, datanode: DataNode) -> DataTree:
        return DataTree._construct_direct(
            name=name,
            parent=datanode._parent,
            children=datanode._children,
            variables=datanode._variables,
        )

    def _construct_datatree(self, name: str) -> DataTree:
        """Used when a single child node is requested"""
        child_datanode = self._children[name]
        return DataTree._construct_from_datanode(name=name, datanode=child_datanode)

@TomNicholas
Copy link
Member Author

Alternatively Stephan suggested today that this might be achieved just by ensuring that group insertion occurs by making a shallow copy instead.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working design question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants