Implement iloc-getitem using parse-don't-validate approach #13534

wence- · 2023-06-08T14:46:14Z

Description

To simplify the low-level implementation of iloc-based getitem on both
Series and DataFrames, change the dispatching approach to parse the
user-provided "unstructured" key into structured data (an appropriate
tagged union using new dataclasses). At the libcudf level, there are four
styles of indexing we can do:

index by slice
index by mask
index by map
index by scalar

iloc keys are parsed into information that tags them by type and
normalises the key to an appropriate column or other low-level object.

This centralises the business logic for index parsing in a
single place, and ensures that downstream consumers of the validated
and normalised indexer don't need to inspect it again to determine
what to do. Note that we treat index by scalar as composition of index
by map with get_element (since that simplifies the logic when
extracting the single row of a dataframe: we want to keep it on
device), but the scalar "type tag" allows us to determine this
unambiguously without reinspecting the key.

The major benefits will come when updating loc-based getitem (where
the parsing rules are more complicated, but eventually turn into one
of the above four cases). In this latter case, we will no longer
attempt to turn a loc-based key into a "user-facing" key for iloc, but
rather will call directly into the pre-parsed interface.

That said, we already provide some performance improvements since we
only do inspection once.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

wence-

Some signposts

python/cudf/cudf/core/indexing_utils.py

python/cudf/cudf/core/dataframe.py

python/cudf/cudf/core/indexing_utils.py

python/cudf/cudf/core/series.py

python/cudf/cudf/core/indexing_utils.py

python/cudf/cudf/api/types.py

python/cudf/cudf/core/indexing_utils.py

python/cudf/cudf/core/indexed_frame.py

shwina · 2023-06-21T11:47:32Z

Just a few general comments for now. This is shaping up to look very nice!

python/cudf/cudf/core/indexing_utils.py

python/cudf/cudf/core/copy_types.py

python/cudf/cudf/core/join/join.py

python/cudf/cudf/core/series.py

python/cudf/cudf/core/indexing_utils.py

vyasr

Just did a first pass through. This is great! Much cleaner and more consistent treatment of how indexers are set up. I've verified the general approach, but I'll need to take another pass to check that all the actual conditions in each case are valid. Will do that ASAP.

python/cudf/cudf/core/_base_index.py

To simplify the low-level implementation of iloc-based getitem on both Series and DataFrames, change the dispatching approach to parse the user-provided "unstructured" key into structured data (a tagged union using an enum + tuple). At the libcudf level, there are four styles of indexing we can do: 1. index by slice 2. index by mask 3. index by map 4. index by scalar iloc keys are parsed into information that tags them by type and normalises the key to an appropriate column or other low-level object. This centralises the business logic for index parsing in a single place, and ensures that downstream consumers of the validated and normalised indexer don't need to inspect it again to determine what to do. Note that we treat index by scalar as composition of index by map with get_element (since that simplifies the logic when extracting the single row of a dataframe: we want to keep it on device), but the scalar "type tag" allows us to determine this unambiguously without reinspecting the key. The major benefits will come when updating loc-based getitem (where the parsing rules are more complicated, but eventually turn into one of the above four cases). In this latter case, we will no longer attempt to turn a loc-based key into a "user-facing" key for iloc, but rather will call directly into the pre-parsed interface. That said, we already provide some performance improvements since we only do inspection once. - Closes rapidsai#13013 - Closes rapidsai#13267 - Closes rapidsai#13515

Can't use libcudf.copying.gather since we need to do some post-processing on categorical and struct columns. Staying in the Series API gets us that for free.

Also use dataclasses as poor man's ADTs rather than tuple with tag field. Some renaming.

vyasr

This is a major improvement IMO. I don't have much to suggest here; the changeset is quite large, and generally everything looks good, so I'm inclined to merge sooner rather than later and seek incremental improvements.

python/cudf/cudf/core/dataframe.py

…ndexing-parse

python/cudf/cudf/core/copy_types.py

python/cudf/cudf/core/algorithms.py

python/cudf/cudf/core/copy_types.py

bdice

Thanks @wence-! I have some comments, as well as the offline discussion about how to use a constructor/factory here for validation.

python/cudf/cudf/core/copy_types.py

python/cudf/cudf/core/dataframe.py

python/cudf/cudf/core/indexed_frame.py

python/cudf/cudf/core/indexing_utils.py

python/cudf/cudf/core/series.py

python/cudf/cudf/tests/test_indexing.py

Rather than having free functions to construct the witness types, the default constructor validates correctness, and a classmethod from_column_unchecked allows one to build a witness type asserting correctness by fiat.

shwina

This is great! Thanks @wence- !

bdice

This is dramatically more usable / readable than the previous state, in my view. Excellent work!! I have a few comments, then this should be good to go.

python/cudf/cudf/core/copy_types.py

bdice · 2023-07-12T12:17:03Z

python/cudf/cudf/core/copy_types.py

+            raise IndexError("Boolean mask must have bool dtype")
+        if len(column) != nrows:
+            raise IndexError(


I vote for these to be TypeError and ValueError, respectively. Similarly for other classes in this file.

The docs for IndexError say that if an index is not an integer, TypeError is raised.

My preference for a ValueError is that this occurs during the construction of a gather map with an invalid value, whereas typically I only see IndexError raised when an out-of-bounds access is being performed. Here, the access never actually occurs because of the validation. If the gather could attempt a disallowed access, then perhaps IndexError would be suitable for that case.

https://docs.python.org/3/library/exceptions.html#IndexError

I think I picked IndexError because otherwise I need to catch ValueError and raise IndexError in (for example) DataFrame.take and xxx.iloc[out-of-bounds-index]. I can do that, but potentially it hides other problems

But changed the first to TypeError.

@wence- also mentioned offline that this aligns with pandas. Please align with pandas, I hadn't considered the impact there.

python/cudf/cudf/core/copy_types.py

python/cudf/cudf/core/indexed_frame.py

python/cudf/cudf/core/join/join.py

python/cudf/cudf/core/indexing_utils.py

…ndexing-parse

bdice

Approving to unblock. Thanks for your work on this @wence-, the result is quite nice.

wence- · 2023-07-14T09:15:18Z

/merge

wence- · 2023-07-14T09:15:29Z

Thanks everyone!

The cudf-internal _gather and _apply_boolean_mask methods now accept tight types rather than arbitrary columns. So we must adapt to that change here.

As title, addresses upstream cudf change rapidsai/cudf#13534. Fixes #1222 Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Mark Harris (https://github.com/harrism) - H. Thomson Comer (https://github.com/thomcom) URL: #1219

wence- requested a review from a team as a code owner June 8, 2023 14:46

wence- requested review from vyasr and galipremsagar June 8, 2023 14:46

github-actions bot added the Python Affects Python cuDF API. label Jun 8, 2023

wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 8, 2023

wence- added this to the Pandas API Alignment and Coverage milestone Jun 8, 2023

wence- commented Jun 8, 2023

View reviewed changes

wence- commented Jun 9, 2023

View reviewed changes

python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved

shwina reviewed Jun 13, 2023

View reviewed changes

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved

shwina reviewed Jun 13, 2023

View reviewed changes

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved

shwina reviewed Jun 21, 2023

View reviewed changes

wence- commented Jun 22, 2023

View reviewed changes

python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved

wence- commented Jun 22, 2023

View reviewed changes

python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved

wence- force-pushed the wence/fea/indexing-parse branch from 080be95 to 9b8ec14 Compare June 22, 2023 17:22

wence- commented Jun 22, 2023

View reviewed changes

python/cudf/cudf/core/copy_types.py Show resolved Hide resolved

wence- commented Jun 22, 2023

View reviewed changes

python/cudf/cudf/core/join/join.py Outdated Show resolved Hide resolved

wence- commented Jun 22, 2023

View reviewed changes

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved

vyasr reviewed Jun 22, 2023

View reviewed changes

python/cudf/cudf/core/indexing_utils.py Outdated Show resolved Hide resolved

vyasr reviewed Jun 22, 2023

View reviewed changes

wence- commented Jun 23, 2023

View reviewed changes

python/cudf/cudf/core/_base_index.py Show resolved Hide resolved

wence- added 8 commits June 23, 2023 10:23

Add iloc-getitem benchmarks

93c1d21

Length-1 categoricals are not scalars

64b093e

Type annotate the frame in SeriesIlocIndexer

6b649bd

TypeAlias from typing_extensions for py 3.9

8ad58a8

Use _gather for scalar indexing

b43d93a

Can't use libcudf.copying.gather since we need to do some post-processing on categorical and struct columns. Staying in the Series API gets us that for free.

Introduce GatherMap and BooleanMask

a479a34

Also use dataclasses as poor man's ADTs rather than tuple with tag field. Some renaming.

Minor simplifications

5e4af4a

Indexer dataclasses have the same field name

dbf56b8

vyasr approved these changes Jun 30, 2023

View reviewed changes

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/branch-23.08' into wence/fea/i…

ad1b21a

…ndexing-parse

shwina reviewed Jun 30, 2023

View reviewed changes

python/cudf/cudf/core/copy_types.py Outdated Show resolved Hide resolved

bdice reviewed Jun 30, 2023

View reviewed changes

python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/copy_types.py Outdated Show resolved Hide resolved

bdice requested changes Jul 11, 2023

View reviewed changes

wence- added 8 commits July 11, 2023 17:00

Refactor GatherMap and BooleanMask construction

b763ebb

Rather than having free functions to construct the witness types, the default constructor validates correctness, and a classmethod from_column_unchecked allows one to build a witness type asserting correctness by fiat.

Remove walrus

12e66fc

Adapt benchmark

b92638d

Minor docstring fixes

99d3da1

Clarify comment and fix keep_index handling in _slice

b046539

Clarify scope of pytest.raises

1ace86a

Numpydoc formatting

e547372

Simplify clamping to range

892ee14

shwina approved these changes Jul 12, 2023

View reviewed changes

Fix some cases missed in refactor

803fbc0

bdice reviewed Jul 12, 2023

View reviewed changes

wence- added 3 commits July 12, 2023 15:27

A few more small fixes

762eb1c

Don't xfail, but rather pytest.raises

943c58e

Merge remote-tracking branch 'upstream/branch-23.08' into wence/fea/i…

dffdc4e

…ndexing-parse

bdice approved these changes Jul 13, 2023

View reviewed changes

rapids-bot bot merged commit e0ffbd7 into rapidsai:branch-23.08 Jul 14, 2023
53 checks passed

wence- deleted the wence/fea/indexing-parse branch July 14, 2023 09:15

wence- added a commit to wence-/cuspatial that referenced this pull request Jul 17, 2023

Adapt internal API to rapidsai/cudf#13534

ab43e7d

The cudf-internal _gather and _apply_boolean_mask methods now accept tight types rather than arbitrary columns. So we must adapt to that change here.

This was referenced Jul 17, 2023

Update GeoDataFrame to Use the Structured GatherMap Class rapidsai/cuspatial#1219

Merged

Fix cuspatial _gather API calls to reflect upstream cudf changes rapidsai/cuspatial#1222

Closed

wence- mentioned this pull request Jul 18, 2023

Parse (non-MultiIndex) label-based keys to structured data #13717

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement iloc-getitem using parse-don't-validate approach #13534

Implement iloc-getitem using parse-don't-validate approach #13534

wence- commented Jun 8, 2023 •

edited

Loading

wence- left a comment

shwina commented Jun 21, 2023

vyasr left a comment

vyasr left a comment

bdice left a comment

shwina left a comment

bdice left a comment

bdice Jul 12, 2023

wence- Jul 12, 2023

wence- Jul 12, 2023

bdice Jul 13, 2023

bdice left a comment

wence- commented Jul 14, 2023

wence- commented Jul 14, 2023

Implement iloc-getitem using parse-don't-validate approach #13534

Implement iloc-getitem using parse-don't-validate approach #13534

Conversation

wence- commented Jun 8, 2023 • edited Loading

Description

Checklist

wence- left a comment

Choose a reason for hiding this comment

shwina commented Jun 21, 2023

vyasr left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

shwina left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

bdice Jul 12, 2023

Choose a reason for hiding this comment

wence- Jul 12, 2023

Choose a reason for hiding this comment

wence- Jul 12, 2023

Choose a reason for hiding this comment

bdice Jul 13, 2023

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

wence- commented Jul 14, 2023

wence- commented Jul 14, 2023

wence- commented Jun 8, 2023 •

edited

Loading