-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
isel returns one index per coordinate for multi-coordinate custom indexes #10063
Comments
Honestly, I only use the most basic isel operations since I found many times that these other ones had surprising behaviors. |
If by "other ones" you mean pandas multi-index then yes I agree some behavior may be surprising... hopefully this will improve in the future. PandasMultiIndex is currently the main multi-coordinate xarray index that I'm aware of (letting aside experiments and prototypes). Other more recent multi-coordinate indexes (such as CoordinateTransformIndex added in #9543) enable long-awaited features so I think it is important to make it work besides the more basic cases. Would it be OK to revert the fast path isel function unless (or until) we can fix it with multi-coordinate indexes or find alternative optimization approaches? |
Honestly, I think that xarray performance has been a pain point for many users beyond just me. Its easy to create nice syntax for toy examples, but when you get to 100's of variables of metadata, which is why one would come to xarray, then the ability to index quickly, is critical. I understand the desire to have multi-indexers, but many times, I've simply used "orthogonal indexing" which doesn't require the use of a specific multi-indexer. I suspect many users are in the same boat. its funny that this issue is happening now, I was hoping to submit benchmark PRs based on the dataset i used in that PR. I've honestly been using xarray for years, and never used I really would be sad if these performance optimizations would be reverted. |
So that I don't forget to say it, thank you for the thorough investigation. Is there any way to have "MultiIndexers" declare themselves? I am happy if the optimizations only apply to the case where there exactly 0 multiindexers in the dataset. |
The Xarray index refactor has been initiated not a long time ago and is still work in progress (documentation, etc.) so I expect it will still take some time before it is more widely used at its full potential across the Xarray ecosystem. But the potential is there (#7041) and it has been part of the roadmap since quite some time.
So has been other aspects such as the issues linked in #9543. (on a related note, coordinate transforms may bring large performance gains for pretty common use cases).
I agree on this. There is probably still many places where unnecessary variable / index copies or iterations can be avoided. This inevitably happens after many refactors and organic growth. On such code base this would require a good amount of effort to clean it up, though.
I'm afraid there isn't any currently. Anyway, I'll try to find optimization alternatives working with multiindexers without affecting too much the benchmarks that you've done in #9002. |
IMO we can prioritize correctness over performance. For now, can we opt-in to the fastpath when all indexers are |
That sounds reasonable to me. |
I agree. |
I was trying to find a way to detect this, but it isn't obvious to me how. |
Thank you @benbovy for elevating my "solution" |
What happened?
#9002 introduced a fast path function for applying
isel
to each index found in a Dataset or DataArray. This function doesn't seem to work for multi-coordinate indexes, as notified by a comment in the function and as illustrated below:A special case has been added for PandasMultiIndex to avoid that issue. However we expect there will be other multi-coordinate Xarray indexes besides PandasMultiIndex (see, e.g., #10055 for a concrete example), living either in Xarray or in 3rd-party libraries. This makes difficult skipping the fast path function based solely on the index type.
From #9002:
@hmaarrfk any idea on how to avoid or optimize this in a way that works for arbitrary multi-coordinate indexes?
Another possible optimization could be caching
_id_coord_names
more aggressively (e.g., copy the cached value - if any - inIndexes.copy()
) and/or caching theIndexes
object returned byDataset.xindexes
(currently a new object is created at each call).What did you expect to happen?
A unique index for each index coordinate after
isel
:Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment
xarray: 2024.1.2.dev898+g406b03b70
pandas: 2.2.3
numpy: 2.0.2
scipy: 1.15.1
netCDF4: 1.7.2
pydap: None
h5netcdf: 1.4.1
h5py: 3.12.1
zarr: 2.18.4
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.12.1
distributed: 2024.12.1
matplotlib: 3.10.0
cartopy: None
seaborn: None
numbagg: 0.8.2
fsspec: 2024.12.0
cupy: None
pint: None
sparse: None
flox: 0.9.15
numpy_groupies: 0.11.2
setuptools: 75.8.0
pip: 24.3.1
conda: None
pytest: 8.3.4
mypy: 1.15.0
IPython: 8.31.0
sphinx: None
The text was updated successfully, but these errors were encountered: