Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Replacing na_sentinel with null mask for Series.label_encoding #5622

Closed
galipremsagar opened this issue Jul 2, 2020 · 5 comments
Labels
proposal Change current process or code Python Affects Python cuDF API.

Comments

@galipremsagar
Copy link
Contributor

What is your question?

Opening up the discussion into a new issue that happened in PR 5619

We might want to retain na_sentinel for feature parity with Pandas Series.factorize(which internally calls label_encoding). But the proposal here is to default label_encoding's na_sentinel to None and set rows to nulls when we don't have a na_sentinel value supplied.

cc: @kkraus14 @rgsl888prabhu @shwina

@galipremsagar galipremsagar added question Further information is requested Needs Triage Need team to review and classify and removed Needs Triage Need team to review and classify labels Jul 2, 2020
@galipremsagar galipremsagar added the Python Affects Python cuDF API. label Jul 2, 2020
@kkraus14
Copy link
Collaborator

kkraus14 commented Jul 9, 2020

Some thoughts:

Series.label_encoding is very much a legacy function from the early days of pygdf where I don't know if we want to support it as a public API moving forward, especially considering Pandas doesn't have it as an API.

The main difference between Series.label_encoding and Series.factorize is that you can pass an existing dictionary for the categories to Series.label_encoding instead of it calculating the uniques. In Pandas, this is effectively the same as doing pd.Categorical(values, categories=dictionary) which returns a pd.Categorical object which you can access .codes from. Note that pd.Categorical uses -1 as a sentinel value to indicate a value that isn't in the dictionary, whereas cudf's CategoricalColumn implementation uses nulls. When we support cudf.Categorical, we'll have the same functionality as Series.label_encoding in the same API as Pandas, albeit using null instead of -1 for missing values.

Due to this, I'd propose we change the cudf.Series.factorize API to default na_sentinel=None instead of na_sentinel=-1. If na_sentinel is None, that indicates we should use null as the missing data value. Since the implementation under the hood will likely generate nulls for missing data anyway, if a user passes a value other than None, we can just use a fillna afterwards to get expected results.

@shwina
Copy link
Contributor

shwina commented Jul 9, 2020

Input here is also welcome: #5498
relevant PR: #5572

@kkraus14 kkraus14 added proposal Change current process or code and removed question Further information is requested labels Aug 27, 2020
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@vyasr
Copy link
Contributor

vyasr commented Jul 23, 2022

At this point label_encoding has been removed. We need to address #6946 to implement Keith's proposal above of using None as the default.

#9474 is tangentially related.

@vyasr
Copy link
Contributor

vyasr commented Dec 17, 2022

As of pandas 1.5 the na_sentinel parameter is deprecated. In 2.0 there will only be a new parameter use_na_sentinel. I'm going to repurpose #6946 to track that and close this issue as resolved since label_encoding is long gone now.

@vyasr vyasr closed this as completed Dec 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Change current process or code Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

4 participants