Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical labelling of Dataset variables #4665

Closed
Robileo opened this issue Dec 9, 2020 · 3 comments
Closed

Hierarchical labelling of Dataset variables #4665

Robileo opened this issue Dec 9, 2020 · 3 comments

Comments

@Robileo
Copy link

Robileo commented Dec 9, 2020

Context and problem
In my every day work I use pandas to store and process system simulation results. The lack of a feature of pandas DataFrame in xarray Dataset prevent me from switching to xarray.

I often have to store hierarchical data such as "rotational speed of engine A of system B". In a DataFrame, I label this variable with a tuple ('system_B', 'engine_A', 'speed'). This results in a hierarchical labelling of the DataFrame variables : df.system_B.engine_A.speed gives a Series containing the values of the speed and df.system_B gives a DataFrame containing all the variables of system B.

Here is an example :

import numpy as np
import pandas as pd

t = np.arange(0, 5, 0.5)
speed = np.cos(t)

columns = pd.MultiIndex.from_tuples([('system_B', 'engine_A', 'speed')])

df = pd.DataFrame(speed, columns=columns, index=t)

print(df.system_B.engine_A.speed)  # Gives a pd.Series
print(df.system_B)  # Gives a pd.DataFrame

I can make a Dataset from this DataFrame but accessing the variables is more difficult (in v0.16.2):

import xarray as xr
ds = xr.Dataset(df)
print(ds[('system_B', 'engine_A', 'speed')]) # OK
print(ds.system_B.engine_A.speed)  # FAIL
print(ds[('system_B')]) # FAIL

I cannot benefit from IPython code completion and I cannot access easily to all the variables of system_B.

Solution
A solution could be to copy the behaviour of Pandas' DataFrame :

  • The internal structure of Dataset does not change. The Dataset is flat : all the variables share the same dimensions and the same coordinates. Only the access to variables labelled with tuples is changed.
  • In my example : when the user types ds.system_B, if no variable named system_B is found, xarray returns a Dataset containing all the variables labelled with a tuple beginning with 'system_B'. As a result, ds.system_B gives a Dataset and ds.system_B.engine_A.speed gives a DataArray.
  • If possible, the solution should be compatible with IPython code completion
  • Additional features could be added such as the method reorder_level of Pandas' DataFrame.

Additional context
Hierarchical data have already been discussed in previous issues including #1092 and #4118. The questions related to this topic are complex : should levels share coordinates with each other ? How dealing with slicing, concatenation ... ? etc...
The feature I propose does not fulfil all the needs but allows hierarchical ordering of variables without changing the internal structure of Dataset. This feature have already been proposed in this comment of #1092 but does not seems to have been implemented since then.

@jhamman
Copy link
Member

jhamman commented Dec 9, 2020

@Robileo - thanks for the interesting question. Quick question. Would it work to reshape your data into a N-D array, ie t X engine X system?

@Robileo
Copy link
Author

Robileo commented Dec 10, 2020

@jhamman thanks for your answer. In the following situation, making a N-D array would work :

|
|-engine_A
|    |---- speed
|
|-engine_B
|    |---- speed
|
|-engine_C
|    |---- speed

The array speed would have a dimension engine with coordinates ('A', 'B', 'C'). But my data often looks like this :

|-engine
|    |---- speed
|    |---- position
|
|-cooling_system
|    |---- temperature
|    |---- pump
|           |---- pressure
|           |---- mass_flow

In that case, finding common dimensions between all the variables is much less intuitive. But labeling the variables with tuples is quite easy :

(engine, speed)
(engine, position)
(cooling_system, temperature)
(cooling_system, pump, pressure)
(cooling_system, pump, mass_flow)

Pandas supports this hierarchical labeling pretty well. But when I run several simulations my data are multi-dimensional and xarray becomes very interesting. If xarray had this feature it would be the ideal tool !

@shoyer
Copy link
Member

shoyer commented Jul 2, 2021

Closing this issue to keep discussion consolidated in #4118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants