Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible indexes: add Index base class and xindexes properties #5102

Merged
merged 23 commits into from
May 11, 2021

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Apr 2, 2021

This PR clears up the path for flexible indexes:

  • it adds a new IndexAdapter Index base class that is meant to be inherited by all xarray-compatible indexes (built-in or 3rd-party)
  • PandasIndexAdapter now inherits from IndexAdapter Index
  • the xarray_obj.xindexes properties return Index (PandasIndexAdapter) instances. xarray_obj.indexes properties still return pandas.Index instances.

The latter is a breaking change, although I'm not sure if the indexes property has been made public yet.

This is still work in progress, there are many broken tests that are not fixed yet. (EDIT: all tests should be fixed now).

There's a lot of dirty fixes to avoid circular dependencies and in the many places where we still need direct access to the pandas.Index objects, but I'd expect that these will be cleaned-up further in the refactoring.

@shoyer
Copy link
Member

shoyer commented Apr 2, 2021

  • the xarray_obj.indexes properties now returns IndexAdapter (PandasIndexAdapter) instances instead of pandas.Index instances

The latter is a breaking change, although I'm not sure if the indexes property has been made public yet.

This is indeed unfortunately a public API, so we should think about how to roll this out with minimal disruption.

For example: maybe .indexes should continue to return pandas.Index objects for now, by unwrapped IndexAdapters sorted in ._indexes?

@benbovy
Copy link
Member Author

benbovy commented Apr 3, 2021

maybe .indexes should continue to return pandas.Index objects for now, by unwrapped IndexAdapters sorted in ._indexes?

Yes we could make a special case for pandas indexes. This would also make the refactoring easier now since .indexes is used internally in quite many places that expect pandas.Index objects. Not sure if it's a good solution in the mid/long term, though.

For example, it would be nice to move the logic implemented in convert_label_indexer into PandasIndexAdapter and PandasMultiIndexAdapter classes and call methods of those classes instead of dealing directly with pandas index objects. For this example (and perhaps others) we could use _indexes internally. Alternatively it might make sense to have an .index_adapters property to make things a bit clearer (this may be welcome even for internal purpose IMO).

@shoyer
Copy link
Member

shoyer commented Apr 4, 2021

Rather than xarray.IndexAdapter, maybe we should just call this new object xarray.Index? Calling this object an "adapter" diminishing its importance in Xarray's future API.

I agree that switching the return type of .indexes is probably worthy of a breaking change -- but that breaking change should be done intentionally, once the new indexing functionality works and we are ready to make a major release. We may also want a deprecation cycle. What we don't want to do is change things in an incomplete way now, in a way that makes it hard for us to issue a bug-fix release.

To make development easier, I would suggest adding a new attribute to xarray.Dataset and xarray.DataArray that exposes the new data model, e.g., perhaps .xindexes as short for "xarray indexes". We would then:

  1. Immediately switch xarray to use .xindexes instead of .indexes internally.
  2. Once the new indexing functionality is ready, encourage users to gradually switch from .indexes -> .xindexes by issuing a FutureWarning warning.
  3. After an appropriate period of time, consider making .indexes an alias for .xindexes in a breaking release.

@benbovy
Copy link
Member Author

benbovy commented Apr 6, 2021

Agreed for adding another property along with a couple of depreciation cycles for smooth transition on what is returned by .indexes. I've suggested .index_adapters in my previous comment but I prefer .xindexes.

xarray.Index makes sense too. For subclasses wrapping another index class perhaps it might be still be relevant to use the Adapter suffix, for example to distinguish between a pandas.Index object and a xarray.PandasIndexAdapter object.

@benbovy benbovy marked this pull request as ready for review April 29, 2021 14:09
@benbovy benbovy marked this pull request as draft April 29, 2021 14:10
@pep8speaks
Copy link

pep8speaks commented Apr 29, 2021

Hello @benbovy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-10 15:16:46 UTC

@benbovy benbovy marked this pull request as ready for review April 30, 2021 14:36
@benbovy
Copy link
Member Author

benbovy commented Apr 30, 2021

This is ready for review!

I implemented @shoyer's suggestions and finished updating the (many) places where Xarray directly uses pandas indexes.

I also added a to_pandas_index() method to the Index base class so that we can easily raise when an Xarray operation doesn't support flexible indexes (i.e., indexes that cannot be cast to a pd.Index).

There might still be some quick fixes that we could clean up now, but I think that most things will have to be cleaned up later in the refactoring once additional features/classes/etc. are implemented.

I haven't wrote tests for the Index base class yet as this is still very much preliminary work, I plan to do it in follow-up PRs once the API becomes more stable.

@benbovy benbovy changed the title Flexible indexes: add and use IndexAdapter base class Flexible indexes: add Index base class and xindexes properties Apr 30, 2021
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great start! I have only a few minor concerns :)

xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/indexes.py Outdated Show resolved Hide resolved
xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved
xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved
xarray/core/indexes.py Outdated Show resolved Hide resolved
@benbovy
Copy link
Member Author

benbovy commented May 4, 2021

Thanks for the review @shoyer, I addressed your comments.

@benbovy benbovy mentioned this pull request May 4, 2021
5 tasks
Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments from someone who's excited for fast and lazy coords. :)

xarray/core/indexes.py Outdated Show resolved Hide resolved
Comment on lines +118 to +120
@property
def shape(self) -> Tuple[int]:
return (len(self.array),)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These len() operations adds up and shape is a very popular property in xarray code. I think you should consider caching the property with for example @pandas.util.cache_readonly to speed it up quite a bit.

The cache would have to be cleared if self.array changed size after init, but should that even be allowed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably be welcome indeed. Let's save this for later, we'll need to see if/how we can formalize an (optional) nd-array interface for any xarray.Index (which would be required to reuse the index data as coordinate data like it's the case for pandas indexes but maybe other xarray indexes in the future).

The cache would have to be cleared if self.array changed size after init, but should that even be allowed?

That should not be allowed.

xarray/core/indexes.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataarray.py Outdated Show resolved Hide resolved

return result

def transpose(self, order) -> pd.Index:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this not return a PandasIndex?

Copy link
Member Author

@benbovy benbovy May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That's something I overlooked, probably among other things. That would indeed make sense to return a PandasIndex, although it might not be necessary (at least for now) as the returned pd.Index is later converted into a PandasIndex elsewhere (e.g., like in Variable.transpose). That might even not be desirable as currently (if I'm correct) creating a new PandasIndex from a PandasIndex re-builds the whole underlying pd.Index. Or alternatively we need a fastpath creation for this specific case.

@benbovy
Copy link
Member Author

benbovy commented May 10, 2021

Should we merge this?

In follow-up PRs, I plan to:

  • Create a PandasMultiIndex class and refactor all places in Xarray that rely on pd.MultiIndex (internal changes only).
  • Add public API to Xarray Index for label-based data selection and refactor/move the logic in convert_label_indexer into the Xarray PandasIndex and PandasMultiIndex classes.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- let's merge this and do improvements / further work in follow-up PRs :)

doc/whats-new.rst Outdated Show resolved Hide resolved
xarray/core/indexes.py Outdated Show resolved Hide resolved
@shoyer
Copy link
Member

shoyer commented May 10, 2021

(Feel free to self-merge after fixing the merge conflict! My suggested fix can be done later, I don't want this to block you)

@benbovy
Copy link
Member Author

benbovy commented May 11, 2021

All right let's merge this! Thanks everyone for your review comments.

@benbovy benbovy merged commit 6e14df6 into pydata:master May 11, 2021
dcherian added a commit to TomNicholas/xarray that referenced this pull request May 13, 2021
* upstream/master:
  combine keep_attrs and combine_attrs in apply_ufunc (pydata#5041)
  Explained what a deprecation cycle is (pydata#5289)
  Code cleanup (pydata#5234)
  FacetGrid docstrings (pydata#5293)
  Add whats new for dataset interpolation with non-numerics (pydata#5297)
  Allow dataset interpolation with different datatypes (pydata#5008)
  Flexible indexes: add Index base class and xindexes properties (pydata#5102)
  pre-commit: autoupdate hook versions (pydata#5280)
  convert the examples for apply_ufunc to doctest (pydata#5279)
  fix the new whatsnew section
  Ensure `HighLevelGraph` layers are `Layer` instances (pydata#5271)
dcherian added a commit to matzegoebel/xarray that referenced this pull request May 13, 2021
* upstream/master: (23 commits)
  combine keep_attrs and combine_attrs in apply_ufunc (pydata#5041)
  Explained what a deprecation cycle is (pydata#5289)
  Code cleanup (pydata#5234)
  FacetGrid docstrings (pydata#5293)
  Add whats new for dataset interpolation with non-numerics (pydata#5297)
  Allow dataset interpolation with different datatypes (pydata#5008)
  Flexible indexes: add Index base class and xindexes properties (pydata#5102)
  pre-commit: autoupdate hook versions (pydata#5280)
  convert the examples for apply_ufunc to doctest (pydata#5279)
  fix the new whatsnew section
  Ensure `HighLevelGraph` layers are `Layer` instances (pydata#5271)
  New whatsnew section
  Release-workflow: Bug fix (pydata#5273)
  more maintenance on whats-new.rst (pydata#5272)
  v0.18.0 release highlights (pydata#5266)
  Fix exception when display_expand_data=False for file-backed array. (pydata#5235)
  Warn ignored keep attrs (pydata#5265)
  Disable workflows on forks (pydata#5267)
  fix the built wheel test (pydata#5270)
  pypi upload workflow maintenance (pydata#5269)
  ...
dcherian added a commit to gcaria/xarray that referenced this pull request May 13, 2021
…e_units

* upstream/master:
  combine keep_attrs and combine_attrs in apply_ufunc (pydata#5041)
  Explained what a deprecation cycle is (pydata#5289)
  Code cleanup (pydata#5234)
  FacetGrid docstrings (pydata#5293)
  Add whats new for dataset interpolation with non-numerics (pydata#5297)
  Allow dataset interpolation with different datatypes (pydata#5008)
  Flexible indexes: add Index base class and xindexes properties (pydata#5102)
benbovy added a commit to benbovy/xarray that referenced this pull request Jul 29, 2021
Revert some changes made in pydata#5102 + additional (temporary) fixes.
benbovy added a commit that referenced this pull request Aug 9, 2021
* split index / coordinate variable(s)

- Pass Variable objects to xarray.Index constructor
- The index should create IndexVariable objects (`coords` attribute)
- PandasIndex: IndexVariable wraps PandasIndexingAdpater wraps pd.Index

* one PandasIndexingAdapter subclass for multiindex

* fastpath Index init + from_pandas_index classmethods

* use classmethod constructors instead

* add Index.copy and Index.__getitem__ methods

* wip: clean-up

Revert some changes made in #5102 + additional (temporary) fixes.

* clean-up

* add PandasIndex and PandasMultiIndex tests

* remove unused import

* doc: update what's new

* use xindexes in map_blocks + temp fix

Dataset constructor doesn't accept xarray indexes yet. Create new
coordinates from the underlying pandas indexes.

* update what's new with #5670

* typo
@benbovy benbovy deleted the index-adapter-base-classes branch March 29, 2022 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants