-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement iloc-getitem using parse-don't-validate approach #13534
Implement iloc-getitem using parse-don't-validate approach #13534
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some signposts
Just a few general comments for now. This is shaping up to look very nice! |
080be95
to
9b8ec14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a first pass through. This is great! Much cleaner and more consistent treatment of how indexers are set up. I've verified the general approach, but I'll need to take another pass to check that all the actual conditions in each case are valid. Will do that ASAP.
To simplify the low-level implementation of iloc-based getitem on both Series and DataFrames, change the dispatching approach to parse the user-provided "unstructured" key into structured data (a tagged union using an enum + tuple). At the libcudf level, there are four styles of indexing we can do: 1. index by slice 2. index by mask 3. index by map 4. index by scalar iloc keys are parsed into information that tags them by type and normalises the key to an appropriate column or other low-level object. This centralises the business logic for index parsing in a single place, and ensures that downstream consumers of the validated and normalised indexer don't need to inspect it again to determine what to do. Note that we treat index by scalar as composition of index by map with get_element (since that simplifies the logic when extracting the single row of a dataframe: we want to keep it on device), but the scalar "type tag" allows us to determine this unambiguously without reinspecting the key. The major benefits will come when updating loc-based getitem (where the parsing rules are more complicated, but eventually turn into one of the above four cases). In this latter case, we will no longer attempt to turn a loc-based key into a "user-facing" key for iloc, but rather will call directly into the pre-parsed interface. That said, we already provide some performance improvements since we only do inspection once. - Closes rapidsai#13013 - Closes rapidsai#13267 - Closes rapidsai#13515
Can't use libcudf.copying.gather since we need to do some post-processing on categorical and struct columns. Staying in the Series API gets us that for free.
Also use dataclasses as poor man's ADTs rather than tuple with tag field. Some renaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a major improvement IMO. I don't have much to suggest here; the changeset is quite large, and generally everything looks good, so I'm inclined to merge sooner rather than later and seek incremental improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wence-! I have some comments, as well as the offline discussion about how to use a constructor/factory here for validation.
Rather than having free functions to construct the witness types, the default constructor validates correctness, and a classmethod from_column_unchecked allows one to build a witness type asserting correctness by fiat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks @wence- !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dramatically more usable / readable than the previous state, in my view. Excellent work!! I have a few comments, then this should be good to go.
python/cudf/cudf/core/copy_types.py
Outdated
raise IndexError("Boolean mask must have bool dtype") | ||
if len(column) != nrows: | ||
raise IndexError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for these to be TypeError
and ValueError
, respectively. Similarly for other classes in this file.
The docs for IndexError
say that if an index is not an integer, TypeError
is raised.
My preference for a ValueError
is that this occurs during the construction of a gather map with an invalid value, whereas typically I only see IndexError
raised when an out-of-bounds access is being performed. Here, the access never actually occurs because of the validation. If the gather could attempt a disallowed access, then perhaps IndexError
would be suitable for that case.
https://docs.python.org/3/library/exceptions.html#IndexError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I picked IndexError
because otherwise I need to catch ValueError
and raise IndexError
in (for example) DataFrame.take
and xxx.iloc[out-of-bounds-index]
. I can do that, but potentially it hides other problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But changed the first to TypeError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wence- also mentioned offline that this aligns with pandas. Please align with pandas, I hadn't considered the impact there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock. Thanks for your work on this @wence-, the result is quite nice.
/merge |
Thanks everyone! |
The cudf-internal _gather and _apply_boolean_mask methods now accept tight types rather than arbitrary columns. So we must adapt to that change here.
As title, addresses upstream cudf change rapidsai/cudf#13534. Fixes #1222 Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Mark Harris (https://github.com/harrism) - H. Thomson Comer (https://github.com/thomcom) URL: #1219
Description
To simplify the low-level implementation of iloc-based getitem on both
Series and DataFrames, change the dispatching approach to parse the
user-provided "unstructured" key into structured data (an appropriate
tagged union using new dataclasses). At the libcudf level, there are four
styles of indexing we can do:
iloc keys are parsed into information that tags them by type and
normalises the key to an appropriate column or other low-level object.
This centralises the business logic for index parsing in a
single place, and ensures that downstream consumers of the validated
and normalised indexer don't need to inspect it again to determine
what to do. Note that we treat index by scalar as composition of index
by map with get_element (since that simplifies the logic when
extracting the single row of a dataframe: we want to keep it on
device), but the scalar "type tag" allows us to determine this
unambiguously without reinspecting the key.
The major benefits will come when updating loc-based getitem (where
the parsing rules are more complicated, but eventually turn into one
of the above four cases). In this latter case, we will no longer
attempt to turn a loc-based key into a "user-facing" key for iloc, but
rather will call directly into the pre-parsed interface.
That said, we already provide some performance improvements since we
only do inspection once.
Ellipsis
#13267Checklist