Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement iloc-getitem using parse-don't-validate approach #13534

Merged
merged 25 commits into from
Jul 14, 2023

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Jun 8, 2023

Description

To simplify the low-level implementation of iloc-based getitem on both
Series and DataFrames, change the dispatching approach to parse the
user-provided "unstructured" key into structured data (an appropriate
tagged union using new dataclasses). At the libcudf level, there are four
styles of indexing we can do:

  1. index by slice
  2. index by mask
  3. index by map
  4. index by scalar

iloc keys are parsed into information that tags them by type and
normalises the key to an appropriate column or other low-level object.

This centralises the business logic for index parsing in a
single place, and ensures that downstream consumers of the validated
and normalised indexer don't need to inspect it again to determine
what to do. Note that we treat index by scalar as composition of index
by map with get_element (since that simplifies the logic when
extracting the single row of a dataframe: we want to keep it on
device), but the scalar "type tag" allows us to determine this
unambiguously without reinspecting the key.

The major benefits will come when updating loc-based getitem (where
the parsing rules are more complicated, but eventually turn into one
of the above four cases). In this latter case, we will no longer
attempt to turn a loc-based key into a "user-facing" key for iloc, but
rather will call directly into the pre-parsed interface.

That said, we already provide some performance improvements since we
only do inspection once.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner June 8, 2023 14:46
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 8, 2023
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 8, 2023
Copy link
Contributor Author

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some signposts

python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/api/types.py Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
@shwina
Copy link
Contributor

shwina commented Jun 21, 2023

Just a few general comments for now. This is shaping up to look very nice!

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a first pass through. This is great! Much cleaner and more consistent treatment of how indexers are set up. I've verified the general approach, but I'll need to take another pass to check that all the actual conditions in each case are valid. Will do that ASAP.

To simplify the low-level implementation of iloc-based getitem on both
Series and DataFrames, change the dispatching approach to parse the
user-provided "unstructured" key into structured data (a tagged
union using an enum + tuple). At the libcudf level, there are four
styles of indexing we can do:

1. index by slice
2. index by mask
3. index by map
4. index by scalar

iloc keys are parsed into information that tags them by type and
normalises the key to an appropriate column or other low-level object.

This centralises the business logic for index parsing in a
single place, and ensures that downstream consumers of the validated
and normalised indexer don't need to inspect it again to determine
what to do. Note that we treat index by scalar as composition of index
by map with get_element (since that simplifies the logic when
extracting the single row of a dataframe: we want to keep it on
device), but the scalar "type tag" allows us to determine this
unambiguously without reinspecting the key.

The major benefits will come when updating loc-based getitem (where
the parsing rules are more complicated, but eventually turn into one
of the above four cases). In this latter case, we will no longer
attempt to turn a loc-based key into a "user-facing" key for iloc, but
rather will call directly into the pre-parsed interface.

That said, we already provide some performance improvements since we
only do inspection once.

- Closes rapidsai#13013
- Closes rapidsai#13267
- Closes rapidsai#13515
Can't use libcudf.copying.gather since we need to do some
post-processing on categorical and struct columns. Staying in the
Series API gets us that for free.
Also use dataclasses as poor man's ADTs rather than tuple with tag
field.

Some renaming.
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a major improvement IMO. I don't have much to suggest here; the changeset is quite large, and generally everything looks good, so I'm inclined to merge sooner rather than later and seek incremental improvements.

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/copy_types.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wence-! I have some comments, as well as the offline discussion about how to use a constructor/factory here for validation.

python/cudf/cudf/core/copy_types.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/series.py Show resolved Hide resolved
python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_indexing.py Outdated Show resolved Hide resolved
Rather than having free functions to construct the witness types, the
default constructor validates correctness, and a classmethod
from_column_unchecked allows one to build a witness type asserting
correctness by fiat.
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks @wence- !

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dramatically more usable / readable than the previous state, in my view. Excellent work!! I have a few comments, then this should be good to go.

python/cudf/cudf/core/copy_types.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/copy_types.py Outdated Show resolved Hide resolved
Comment on lines 142 to 144
raise IndexError("Boolean mask must have bool dtype")
if len(column) != nrows:
raise IndexError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for these to be TypeError and ValueError, respectively. Similarly for other classes in this file.

The docs for IndexError say that if an index is not an integer, TypeError is raised.

My preference for a ValueError is that this occurs during the construction of a gather map with an invalid value, whereas typically I only see IndexError raised when an out-of-bounds access is being performed. Here, the access never actually occurs because of the validation. If the gather could attempt a disallowed access, then perhaps IndexError would be suitable for that case.

https://docs.python.org/3/library/exceptions.html#IndexError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I picked IndexError because otherwise I need to catch ValueError and raise IndexError in (for example) DataFrame.take and xxx.iloc[out-of-bounds-index]. I can do that, but potentially it hides other problems

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But changed the first to TypeError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- also mentioned offline that this aligns with pandas. Please align with pandas, I hadn't considered the impact there.

python/cudf/cudf/core/copy_types.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved
python/cudf/cudf/core/join/join.py Show resolved Hide resolved
python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock. Thanks for your work on this @wence-, the result is quite nice.

@wence-
Copy link
Contributor Author

wence- commented Jul 14, 2023

/merge

@rapids-bot rapids-bot bot merged commit e0ffbd7 into rapidsai:branch-23.08 Jul 14, 2023
53 checks passed
@wence-
Copy link
Contributor Author

wence- commented Jul 14, 2023

Thanks everyone!

@wence- wence- deleted the wence/fea/indexing-parse branch July 14, 2023 09:15
wence- added a commit to wence-/cuspatial that referenced this pull request Jul 17, 2023
The cudf-internal _gather and _apply_boolean_mask methods now accept
tight types rather than arbitrary columns. So we must adapt to that
change here.
rapids-bot bot pushed a commit to rapidsai/cuspatial that referenced this pull request Jul 18, 2023
As title, addresses upstream cudf change rapidsai/cudf#13534.

Fixes #1222

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Mark Harris (https://github.com/harrism)
  - H. Thomson Comer (https://github.com/thomcom)

URL: #1219
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
4 participants