Deprecate na_sentinel, add use_na_sentinel #46910

rhshadrach · 2022-04-30T22:58:41Z

Currently specifying na_sentinel=None in pd.factorize will use the sentinel -1 and set (internally) dropna=False. EA's factorize currently doesn't allow na_sentinel being None, and #46601 added the dropna argument to EA's factorize.

It seems best to me to change pd.factorize to match EA's factorize. Tagging as 1.5 since if we are to deprecate, I'd like to get it in.

cc @jbrockmendel @jorisvandenbossche @TomAugspurger

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2022-05-01T21:40:23Z

+1

jorisvandenbossche · 2022-05-02T21:18:13Z

The other option could also be to allow na_sentinel to be None in EA.factorize? (so to align that with pd.factorize, instead of the other way around)

We actually added dropna at some point, and then removed it again: #35667, #35852 (I didn't yet look back in detail to understand the reasoning or whether I still agree with it ;)).

rhshadrach · 2022-05-02T21:32:25Z

Thanks @jorisvandenbossche! When I was reasoning about it in #46601 (comment), I was focused on the code:

pandas/pandas/core/algorithms.py

Lines 715 to 718 in 60ac973

    
           dropna = True 
        
           if na_sentinel is None: 
        
               na_sentinel = -1 
        
               dropna = False

thinking that na_sentinel=-1 was used. But of course, that's only temporary; after this block, the sentinel gets replace by a nonnegative code. With this in mind, na_sentinel=None as the argument now makes a lot more sense to me. It still seems a bit odd to me that na_sentinel=None adds the null value to uniques (which is where dropna seems to make more sense). But since thena_sentinel value doesn't get used when "dropna=False", it makes more sense to me to have this all be one parameter.

jbrockmendel · 2022-05-13T23:25:34Z

Are there known use cases where users pass na_sentinel other than -1 or None? e.g. i dont think ive ever seen na_sentinel=14.

rhshadrach · 2022-05-14T14:45:15Z

A positive sentinel is likely to break things

ser = pd.Series([1, 2, np.nan, 3])
print(pd.factorize(ser, na_sentinel=2))

(array([0, 1, 2, 2]), Float64Index([1.0, 2.0, 3.0], dtype='float64'))

I don't know any use case for specifying na_sentinel in pd.factorize, and if there is one, a user can change the sentinel after with a single line of code.

+1 on deprecating the specifying of the sentinel value. In that case, there are two options:

code NAs as -1, do not include null values in uniques
na_sentinel is a nonnegative integer; include null values in uniques

This is then very similar to pyarrow's null_encoding argument in dictionary_encode being "mask" or "encode"; perhaps those would be good names going forward? If that is a good direction, do we prefer "null_encoding" vs "na_encoding" (thinking particularly of pd.NA here).

rhshadrach · 2022-05-17T21:55:30Z

@jorisvandenbossche - any thoughts on the proposal to deprecate specifying the specific value of na_sentinel?

rhshadrach · 2022-05-23T01:46:27Z

@jorisvandenbossche - friendly ping

rhshadrach · 2022-05-24T20:56:56Z

Assuming we go forward with my suggestion in #46910 (comment), with the EA interface currently marked as experimental, we have the option of removing the ability to set na_sentinel immediately or we could have the signature (more closely) match pd.factorize and deprecate in parallel.

Proposal 1 - Deprecate na_sentinel in pd.factorize; change EA to future state immediately:

pd.factorize(values, sort: bool = False, na_sentinel: int | None = -1, size_hint: int | None = None, na_encoding: str = 'mask')
ExtensionArray.factorize(self, na_encoding: str = 'mask')

Proposal 2 - Deprecate na_sentinel in pd.factorize; have EA match (with na_sentinel also deprecated):

pd.factorize(values, sort: bool = False, na_sentinel: int | None = -1, size_hint: int | None = None, na_encoding: str = 'mask')
ExtensionArray.factorize(self, na_sentinel: int | None = -1, na_encoding: str = 'mask')

Since pd.factorize doesn't match ExtensionArray.factorize today, that makes me think Proposal 1 would be a better route. Any preference @jbrockmendel?

jorisvandenbossche · 2022-05-24T21:33:17Z

I suppose there are theoretical use case of other na_sentinel values as -1. For example taking a large positive number, so missing values sorts last instead of first (not sure that actually makes sense). Whether anybody actually uses this in practice, I also highly doubt. So deprecating it seems fine.

with the EA interface currently marked as experimental,

While we might still label it like that, there are lots of people using it, so I would prefer something with a smooth upgrade path for external EA implementors (so I prefer something more in the line of proposal 2).

For naming, one problem with "mask" is that this words fits less in the pandas context of using "-1" as sentinel (in pyarrow it actually returns an array with nulls (not -1's), which are implemented as a kind of a mask (not fully sure the naming comes from that, though)). Maybe na_encoding="sentinel" instead? (although not fully sure this is actually clearer)

rhshadrach · 2022-05-24T21:43:21Z

Thanks @jorisvandenbossche - I don't have any objection to calling it a "mask", but do see how this usage is a bit of a stretch from a typical use. I'm also happy with your suggestion of na_encoding=["sentinel"|"encode"]. One other option that comes to mind is na_sentinel=[True|False]. But being the same name as the current argument might make deprecation confusing.

jbrockmendel · 2022-05-25T22:35:16Z

I agree with joris about a smooth deprecation path. No real preference otherwise.

rhshadrach · 2022-05-27T18:38:12Z

After trying a few ways, I find myself preferring use_na_sentinel=[True|False]. Will be putting up a PR shortly, but can always change if desired.

rhshadrach added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Deprecate Functionality to remove in pandas API - Consistency Internal Consistency of API/Behavior labels Apr 30, 2022

rhshadrach added this to the 1.5 milestone Apr 30, 2022

rhshadrach mentioned this issue Apr 30, 2022

BUG: algorithms.factorize moves null values when sort=False #46601

Merged

4 tasks

rhshadrach mentioned this issue May 28, 2022

DEPR: na_sentinel in factorize #47157

Merged

4 tasks

mroeschke closed this as completed in #47157 Jun 24, 2022

rhshadrach changed the title ~~Deprecate na_sentinel being None in pd.factorize, add dropna~~ Deprecate na_sentinel, add use_na_sentinel Oct 30, 2022

This was referenced Dec 17, 2022

[DISCUSSION] Replacing na_sentinel with null mask for Series.label_encoding rapidsai/cudf#5622

Closed

[FEA] Support use_na_sentinel in factorize rapidsai/cudf#6946

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate na_sentinel, add use_na_sentinel #46910

Deprecate na_sentinel, add use_na_sentinel #46910

rhshadrach commented Apr 30, 2022 •

edited

Loading

jbrockmendel commented May 1, 2022

jorisvandenbossche commented May 2, 2022

rhshadrach commented May 2, 2022

jbrockmendel commented May 13, 2022

rhshadrach commented May 14, 2022

rhshadrach commented May 17, 2022

rhshadrach commented May 23, 2022

rhshadrach commented May 24, 2022

jorisvandenbossche commented May 24, 2022

rhshadrach commented May 24, 2022 •

edited

Loading

jbrockmendel commented May 25, 2022

rhshadrach commented May 27, 2022

Deprecate na_sentinel, add use_na_sentinel #46910

Deprecate na_sentinel, add use_na_sentinel #46910

Comments

rhshadrach commented Apr 30, 2022 • edited Loading

jbrockmendel commented May 1, 2022

jorisvandenbossche commented May 2, 2022

rhshadrach commented May 2, 2022

jbrockmendel commented May 13, 2022

rhshadrach commented May 14, 2022

rhshadrach commented May 17, 2022

rhshadrach commented May 23, 2022

rhshadrach commented May 24, 2022

jorisvandenbossche commented May 24, 2022

rhshadrach commented May 24, 2022 • edited Loading

jbrockmendel commented May 25, 2022

rhshadrach commented May 27, 2022

rhshadrach commented Apr 30, 2022 •

edited

Loading

rhshadrach commented May 24, 2022 •

edited

Loading