[DISCUSSION] Replacing `na_sentinel` with null mask for `Series.label_encoding` #5622

galipremsagar · 2020-07-02T14:41:35Z

What is your question?

Opening up the discussion into a new issue that happened in PR 5619

We might want to retain na_sentinel for feature parity with Pandas Series.factorize(which internally calls label_encoding). But the proposal here is to default label_encoding's na_sentinel to None and set rows to nulls when we don't have a na_sentinel value supplied.

cc: @kkraus14 @rgsl888prabhu @shwina

The text was updated successfully, but these errors were encountered:

kkraus14 · 2020-07-09T00:30:09Z

Some thoughts:

Series.label_encoding is very much a legacy function from the early days of pygdf where I don't know if we want to support it as a public API moving forward, especially considering Pandas doesn't have it as an API.

The main difference between Series.label_encoding and Series.factorize is that you can pass an existing dictionary for the categories to Series.label_encoding instead of it calculating the uniques. In Pandas, this is effectively the same as doing pd.Categorical(values, categories=dictionary) which returns a pd.Categorical object which you can access .codes from. Note that pd.Categorical uses -1 as a sentinel value to indicate a value that isn't in the dictionary, whereas cudf's CategoricalColumn implementation uses nulls. When we support cudf.Categorical, we'll have the same functionality as Series.label_encoding in the same API as Pandas, albeit using null instead of -1 for missing values.

Due to this, I'd propose we change the cudf.Series.factorize API to default na_sentinel=None instead of na_sentinel=-1. If na_sentinel is None, that indicates we should use null as the missing data value. Since the implementation under the hood will likely generate nulls for missing data anyway, if a user passes a value other than None, we can just use a fillna afterwards to get expected results.

shwina · 2020-07-09T11:26:28Z

Input here is also welcome: #5498
relevant PR: #5572

github-actions · 2021-03-14T19:12:42Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr · 2022-07-23T00:17:32Z

At this point label_encoding has been removed. We need to address #6946 to implement Keith's proposal above of using None as the default.

#9474 is tangentially related.

vyasr · 2022-12-17T00:23:07Z

As of pandas 1.5 the na_sentinel parameter is deprecated. In 2.0 there will only be a new parameter use_na_sentinel. I'm going to repurpose #6946 to track that and close this issue as resolved since label_encoding is long gone now.

galipremsagar added question Further information is requested Needs Triage Need team to review and classify and removed Needs Triage Need team to review and classify labels Jul 2, 2020

galipremsagar mentioned this issue Jul 2, 2020

[REVIEW] Fix issue related to typecasting when using a CategoricalDtype #5619

Merged

galipremsagar added the Python Affects Python cuDF API. label Jul 2, 2020

kkraus14 added proposal Change current process or code and removed question Further information is requested labels Aug 27, 2020

github-actions bot added the inactive-90d label Mar 14, 2021

vyasr closed this as completed Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Replacing `na_sentinel` with null mask for `Series.label_encoding` #5622

[DISCUSSION] Replacing `na_sentinel` with null mask for `Series.label_encoding` #5622

galipremsagar commented Jul 2, 2020

kkraus14 commented Jul 9, 2020

shwina commented Jul 9, 2020 •

edited

Loading

github-actions bot commented Mar 14, 2021

vyasr commented Jul 23, 2022

vyasr commented Dec 17, 2022

[DISCUSSION] Replacing na_sentinel with null mask for Series.label_encoding #5622

[DISCUSSION] Replacing na_sentinel with null mask for Series.label_encoding #5622

Comments

galipremsagar commented Jul 2, 2020

kkraus14 commented Jul 9, 2020

shwina commented Jul 9, 2020 • edited Loading

github-actions bot commented Mar 14, 2021

vyasr commented Jul 23, 2022

vyasr commented Dec 17, 2022

[DISCUSSION] Replacing `na_sentinel` with null mask for `Series.label_encoding` #5622

[DISCUSSION] Replacing `na_sentinel` with null mask for `Series.label_encoding` #5622

shwina commented Jul 9, 2020 •

edited

Loading