Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.to_dataframe() dimension order is not alphabetically sorted by default #9653

Closed
5 tasks done
mgunyho opened this issue Oct 21, 2024 · 4 comments · Fixed by #9662
Closed
5 tasks done

Dataset.to_dataframe() dimension order is not alphabetically sorted by default #9653

mgunyho opened this issue Oct 21, 2024 · 4 comments · Fixed by #9662

Comments

@mgunyho
Copy link
Contributor

mgunyho commented Oct 21, 2024

What happened?

Hi, I noticed that the documentation for Dataset.to_dataframe() says that "by default, dimensions are sorted alphabetically". This is contrast with DataArray.to_dataframe(), where the order is given by the order of the dimensions in the DataArray, which was discussed in this comment.

However, it appears that Dataset.to_dataframe() doesn't in fact sort the orders alphabetically with this example on current main 8f6e45b:

import xarray as xr
ds = xr.Dataset({
    "foo": xr.DataArray(0, coords=[("y", [1, 2, 3]), ("x", [4, 5, 6])]), 
})
print(ds.to_dataframe()) 

I get

     foo
y x     
1 4    0
  5    0
  6    0
2 4    0
  5    0
  6    0
3 4    0
  5    0
  6    0

What did you expect to happen?

The dimensions in the output should be sorted alphabetically, like this:

     foo
x y     
4 1    0
  2    0
  3    0
5 1    0
  2    0
  3    0
6 1    0
  2    0
  3    0

Minimal Complete Verifiable Example

See above

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.7 (main, Oct 1 2024, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)]
python-bits: 64
OS: Linux
OS-release: 6.11.3-200.fc40.x86_64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.9.1.dev73+g8f6e45ba
pandas: 2.2.3
numpy: 1.26.4
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 24.0
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None

@mgunyho mgunyho added bug needs triage Issue that has not been reviewed by xarray team member labels Oct 21, 2024
@keewis
Copy link
Collaborator

keewis commented Oct 21, 2024

this looks like a documentation bug: we can't really sort non-string names alphabetically, so instead we should remove that claim. PRs welcome!

@keewis keewis added topic-documentation and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 21, 2024
@mgunyho
Copy link
Contributor Author

mgunyho commented Oct 21, 2024

Makes sense, I was also a bit surprised to find this inconsistent behavior discussed in that issue comment.

I suppose the correct wording would be something like "the dimensions are in the order in which they appear in the DataArrays in the dataset"? This seems to be the behavior, based on trying different orders of the dictionary elements in this example:

import xarray as xr

ds = xr.Dataset({
    "foo": xr.DataArray(coords=[("x", [1, 2, 3]), ("y", [1, 2, 3])]),
    "bar": xr.DataArray(coords=[("y", [1, 2, 3]), ("x", [1, 2, 3])]),
    "baz": xr.DataArray(coords=[("x", [1, 2, 3])]),
    "qux": xr.DataArray(coords=[("y", [1, 2, 3])]),
})

print(ds.to_dataframe())

@dcherian dcherian removed the bug label Oct 21, 2024
@shoyer
Copy link
Member

shoyer commented Oct 21, 2024

We used to sort dimension names in Dataset.dims, which in turn were used by DataFrame levels. This is no longer the case: #4753

So yes, this is definitely worthy of updating/fixing the documentation!

@shoyer
Copy link
Member

shoyer commented Oct 21, 2024

I suppose the correct wording would be something like "the dimensions are in the order in which they appear in the DataArrays in the dataset"? This seems to be the behavior, based on trying different orders of the dictionary elements in this example:

I would say Dimensions appear in the same order as Dataset.sizes (which is also order of appearance on variables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants