Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate na_sentinel, add use_na_sentinel #46910

Closed
rhshadrach opened this issue Apr 30, 2022 · 12 comments · Fixed by #47157
Closed

Deprecate na_sentinel, add use_na_sentinel #46910

rhshadrach opened this issue Apr 30, 2022 · 12 comments · Fixed by #47157
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API - Consistency Internal Consistency of API/Behavior Deprecate Functionality to remove in pandas
Milestone

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Apr 30, 2022

Currently specifying na_sentinel=None in pd.factorize will use the sentinel -1 and set (internally) dropna=False. EA's factorize currently doesn't allow na_sentinel being None, and #46601 added the dropna argument to EA's factorize.

It seems best to me to change pd.factorize to match EA's factorize. Tagging as 1.5 since if we are to deprecate, I'd like to get it in.

cc @jbrockmendel @jorisvandenbossche @TomAugspurger

@rhshadrach rhshadrach added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Deprecate Functionality to remove in pandas API - Consistency Internal Consistency of API/Behavior labels Apr 30, 2022
@rhshadrach rhshadrach added this to the 1.5 milestone Apr 30, 2022
@jbrockmendel
Copy link
Member

+1

@jorisvandenbossche
Copy link
Member

The other option could also be to allow na_sentinel to be None in EA.factorize? (so to align that with pd.factorize, instead of the other way around)

We actually added dropna at some point, and then removed it again: #35667, #35852 (I didn't yet look back in detail to understand the reasoning or whether I still agree with it ;)).

@rhshadrach
Copy link
Member Author

Thanks @jorisvandenbossche! When I was reasoning about it in #46601 (comment), I was focused on the code:

dropna = True
if na_sentinel is None:
na_sentinel = -1
dropna = False

thinking that na_sentinel=-1 was used. But of course, that's only temporary; after this block, the sentinel gets replace by a nonnegative code. With this in mind, na_sentinel=None as the argument now makes a lot more sense to me. It still seems a bit odd to me that na_sentinel=None adds the null value to uniques (which is where dropna seems to make more sense). But since thena_sentinel value doesn't get used when "dropna=False", it makes more sense to me to have this all be one parameter.

@jbrockmendel
Copy link
Member

Are there known use cases where users pass na_sentinel other than -1 or None? e.g. i dont think ive ever seen na_sentinel=14.

@rhshadrach
Copy link
Member Author

A positive sentinel is likely to break things

ser = pd.Series([1, 2, np.nan, 3])
print(pd.factorize(ser, na_sentinel=2))

(array([0, 1, 2, 2]), Float64Index([1.0, 2.0, 3.0], dtype='float64'))

I don't know any use case for specifying na_sentinel in pd.factorize, and if there is one, a user can change the sentinel after with a single line of code.

+1 on deprecating the specifying of the sentinel value. In that case, there are two options:

  • code NAs as -1, do not include null values in uniques
  • na_sentinel is a nonnegative integer; include null values in uniques

This is then very similar to pyarrow's null_encoding argument in dictionary_encode being "mask" or "encode"; perhaps those would be good names going forward? If that is a good direction, do we prefer "null_encoding" vs "na_encoding" (thinking particularly of pd.NA here).

@rhshadrach
Copy link
Member Author

@jorisvandenbossche - any thoughts on the proposal to deprecate specifying the specific value of na_sentinel?

@rhshadrach
Copy link
Member Author

@jorisvandenbossche - friendly ping

@rhshadrach
Copy link
Member Author

Assuming we go forward with my suggestion in #46910 (comment), with the EA interface currently marked as experimental, we have the option of removing the ability to set na_sentinel immediately or we could have the signature (more closely) match pd.factorize and deprecate in parallel.

Proposal 1 - Deprecate na_sentinel in pd.factorize; change EA to future state immediately:

  • pd.factorize(values, sort: bool = False, na_sentinel: int | None = -1, size_hint: int | None = None, na_encoding: str = 'mask')
  • ExtensionArray.factorize(self, na_encoding: str = 'mask')

Proposal 2 - Deprecate na_sentinel in pd.factorize; have EA match (with na_sentinel also deprecated):

  • pd.factorize(values, sort: bool = False, na_sentinel: int | None = -1, size_hint: int | None = None, na_encoding: str = 'mask')
  • ExtensionArray.factorize(self, na_sentinel: int | None = -1, na_encoding: str = 'mask')

Since pd.factorize doesn't match ExtensionArray.factorize today, that makes me think Proposal 1 would be a better route. Any preference @jbrockmendel?

@jorisvandenbossche
Copy link
Member

I suppose there are theoretical use case of other na_sentinel values as -1. For example taking a large positive number, so missing values sorts last instead of first (not sure that actually makes sense). Whether anybody actually uses this in practice, I also highly doubt. So deprecating it seems fine.

with the EA interface currently marked as experimental,

While we might still label it like that, there are lots of people using it, so I would prefer something with a smooth upgrade path for external EA implementors (so I prefer something more in the line of proposal 2).

For naming, one problem with "mask" is that this words fits less in the pandas context of using "-1" as sentinel (in pyarrow it actually returns an array with nulls (not -1's), which are implemented as a kind of a mask (not fully sure the naming comes from that, though)). Maybe na_encoding="sentinel" instead? (although not fully sure this is actually clearer)

@rhshadrach
Copy link
Member Author

rhshadrach commented May 24, 2022

Thanks @jorisvandenbossche - I don't have any objection to calling it a "mask", but do see how this usage is a bit of a stretch from a typical use. I'm also happy with your suggestion of na_encoding=["sentinel"|"encode"]. One other option that comes to mind is na_sentinel=[True|False]. But being the same name as the current argument might make deprecation confusing.

@jbrockmendel
Copy link
Member

I agree with joris about a smooth deprecation path. No real preference otherwise.

@rhshadrach
Copy link
Member Author

After trying a few ways, I find myself preferring use_na_sentinel=[True|False]. Will be putting up a PR shortly, but can always change if desired.

@rhshadrach rhshadrach changed the title Deprecate na_sentinel being None in pd.factorize, add dropna Deprecate na_sentinel, add use_na_sentinel Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API - Consistency Internal Consistency of API/Behavior Deprecate Functionality to remove in pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants