Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse (non-MultiIndex) label-based keys to structured data #13717

Open
wants to merge 16 commits into
base: branch-23.10
Choose a base branch
from

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Jul 18, 2023

Description

Following on from #13534, this extends the scheme to handle label-based lookups as long as the index is not a multiindex.

As is the case for positional indexing, all of the different ways one can index a frame with labels eventually boil down to indexing by slice, boolean mask, map, or scalar in libcudf. loc-based keys are parsed into information that tags them by type. Since this information is the same as is used for iloc indexing, we can then dispatch to the same "internal" calls that don't do further bounds-checking or normalisation: rather than converting a label-based lookup to an argument we can pass to iloc-getitem (which must reinterpret it), we just take the decision straight away.

The next stage (which will help to remove a bunch of code) will be to handle multiindex keys, but that will be sufficiently complicated that I'd rather do it separately.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner July 18, 2023 15:28
@wence- wence- requested review from bdice and isVoid July 18, 2023 15:28
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 18, 2023
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 18, 2023
@wence- wence- requested review from vyasr and shwina and removed request for isVoid July 18, 2023 15:30
@wence-
Copy link
Contributor Author

wence- commented Jul 18, 2023

Explicitly requested the same set of reviewers as #13534.

On the one hand, I would have liked to handle multiindex lookups as well in one change, but unfortunately I think the rules are too complicated and I'm still waiting for clarification on a bunch of ambiguities in the way pandas handles things. As a consequence this doesn't remove as much code as one would have hoped.

@wence- wence- force-pushed the wence/fea/indexing-loc-getitem branch from 15c955e to c5f8eb4 Compare July 18, 2023 15:39
Copy link
Contributor Author

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some signposts

Comment on lines -480 to -486
def _select_by_names(self, names: abc.Sequence) -> Self:
return self.__class__(
{key: self[key] for key in names},
multiindex=self.multiindex,
level_names=self.level_names,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced this in #13534, but I realised it's better if the column key parsing returns a ColumnAccessor (rather than column names) so it's no longer necessary.

Comment on lines +406 to +409
if len(set(keys)) != len(keys):
raise NotImplementedError(
"cudf DataFrames do not support repeated column names"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now loudly raises when getting from a ColumnAccessor would produce duplicate column names

Comment on lines -125 to -139
def __getitem__(self, arg):
if (
isinstance(self._frame.index, MultiIndex)
or self._frame._data.multiindex
):
# This try/except block allows the use of pandas-like
# tuple arguments into MultiIndex dataframes.
try:
return self._getitem_tuple_arg(arg)
except (TypeError, KeyError, IndexError, ValueError):
return self._getitem_tuple_arg((arg, slice(None)))
else:
if not isinstance(arg, tuple):
arg = (arg, slice(None))
return self._getitem_tuple_arg(arg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer shared between loc and iloc cases.

Comment on lines +223 to +237
row_key, (
col_is_scalar,
ca,
) = indexing_utils.destructure_dataframe_loc_indexer(
arg, self._frame
)
row_spec = indexing_utils.parse_row_loc_indexer(
row_key, self._frame.index
)
return self._frame._getitem_preprocessed(
row_spec, col_is_scalar, ca
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new approach (which doesn't handle multiindex lookups yet).

Comment on lines -329 to +292
pos_range = _get_label_range_or_mask(
self._frame.index, key[0].start, key[0].stop, key[0].step
indexer = indexing_utils.find_label_range_or_mask(
key[0], self._frame.index
)
idx = self._frame.index[pos_range]
index = self._frame.index
if isinstance(indexer, indexing_utils.EmptyIndexer):
idx = index[0:0:1]
elif isinstance(indexer, indexing_utils.SliceIndexer):
idx = index[indexer.key]
else:
idx = index[indexer.key.column]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved _get_label_range_or_mask into indexing_utils and return structured data (which for now we must pull apart here because I haven't touched setitem yet).

Comment on lines +508 to +512
parsed_key = index.find_label_range(key)
if len(range(len(index))[parsed_key]) == 0:
return EmptyIndexer()
else:
return SliceIndexer(parsed_key)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only bit that is relevant for pandas-2 compat.

Comment on lines +563 to +564
# TODO: is this right?
key = key._get_decategorized_column()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right. If the key is a categorical and the index is not, we should treat the key as "array-like" a decategorise it before looking up the values it encodes.

Comment on lines +574 to +576
if isinstance(index, cudf.DatetimeIndex):
# Try to turn strings into datetimes
key = cudf.core.column.as_column(key, dtype=index.dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be astype?

Comment on lines +444 to +446
return GatherMap.from_column_unchecked(
rgather, len(haystack), nullify=False
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice here how we don't need to bounds-check the gather map (since the merge guarantees everything is in-bounds).

Comment on lines +613 to +615
raise NotImplementedError(
"This code path is not designed for multiindices"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things that need to happen in this case:

  1. individual keys can have more cases (e.g. (a, b):(c, d) is a valid slice for a multiindex if the multiindex is ordered), a list-like can be not just single-level labels, but tuples of labels)
  2. If we get a tuple of per-level indexers we must combine them to a single output indexer and the semantics of this combination in pandas are ambiguous in many cases.

Raise NotImplementedError if indexing would try and produce a new
object with duplicate key names. This avoids silently losing
information.
Rather than just returning column names, since we will immediately
construct the ColumnAccessor, make that and return it.
If the index is non-numeric, Series.__getitem__ should fall back to
positional lookup for integer keys.
Similar to positional indexing, implement parsing to structured
IndexingSpec data for label indexing. This does not yet treat
MultiIndex lookups for which the rules for combining the multi-level
keys are more complicated.
Trying to move all the special-case handling into one place.
Additionally, handle step != 1 correctly for datetime indexing.
@wence- wence- force-pushed the wence/fea/indexing-loc-getitem branch from ed805cd to 546fef5 Compare September 1, 2023 11:31
@wence- wence- requested a review from a team as a code owner September 1, 2023 11:31
@wence- wence- requested review from a team as code owners September 1, 2023 11:31
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 1, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wence- wence- changed the base branch from branch-23.08 to branch-23.10 September 1, 2023 11:31
@wence- wence- removed request for a team September 1, 2023 11:31
@rapidsai rapidsai deleted a comment from review-notebook-app bot Sep 1, 2023
@wence-
Copy link
Contributor Author

wence- commented Sep 1, 2023

/ok to test

@vyasr
Copy link
Contributor

vyasr commented May 22, 2024

@wence- did this stall out just waiting for reviews? If so, I'm very sorry I lost track of it. Please bring it up to date and I'm happy to review this ASAP (probably for 24.08 not 24.06 at this point though).

@wence-
Copy link
Contributor Author

wence- commented Jul 31, 2024

@wence- did this stall out just waiting for reviews? If so, I'm very sorry I lost track of it. Please bring it up to date and I'm happy to review this ASAP (probably for 24.08 not 24.06 at this point though).

Sorry, it got rather struck on my end, I will try and resurrect for 24.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
2 participants