BUG: algorithms.factorize moves null values when sort=False #46601

rhshadrach · 2022-04-01T16:46:57Z

closes BUG: groupby with nans always places nans last #46584 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

In the example below, the result_index has nan moved even though sort=False. This is the order that will be in any groupby reduction result and the reason why transform currently returns wrong results.

df = pd.DataFrame({'a': [1, 3, np.nan, 1, 2], 'b': [3, 4, 5, 6, 7]})
print(df.groupby('a', sort=False, dropna=False).grouper.result_index)

# main
Float64Index([1.0, 3.0, 2.0, nan], dtype='float64', name='a')
# this PR
Float64Index([1.0, 3.0, nan, 2.0], dtype='float64', name='a')

cc @jbrockmendel @jreback

pandas/core/algorithms.py

rhshadrach · 2022-04-01T16:53:39Z

pandas/tests/test_algos.py

-                np.array([0, 2, 1, 0], dtype=np.dtype("intp")),
-                np.array(["a", "b", np.nan], dtype=object),
+                np.array([0, 1, 2, 0], dtype=np.dtype("intp")),
+                np.array(["a", None, "b"], dtype=object),


The change from np.nan to None here was an unintended side-effect; is this undesirable?

IMO I think this could be called a bug fix since now a user can preserve None in their input data when using pd.factorize. Probably good to add a whatsnew note for this chage.

Thanks! I completely forgot about this open question. Will do.

pandas/core/algorithms.py

pandas/core/arrays/base.py

rhshadrach · 2022-04-19T22:00:41Z

@jbrockmendel - I've implemented the EA paths. There is a bit of a discrepancy between pd.factorize and EA's factorize (as well as factorize_array). Namely, the former can take na_sentinel=None whereas the latter two have na_sentinel being integers. Because of this, I added the dropna to the EA factorize as well as factorize_array. Could also call this argument ignore_na, but I think that isn't used by the public API so dropna seemed better.

I wonder if we should add dropna to pd.factorize and deprecate na_sentinel being None there. Can also go the other route here and allow na_sentinel being None in EA's factorize; no strong opinion but I find na_sentinel=None meaning "Don't drop null values from uniques; and code nulls as -1" not very clear.

…orize_na

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

…orize_na

rhshadrach · 2022-04-29T22:02:26Z

@jbrockmendel - gentle ping.

jreback

over to you @jbrockmendel

pandas/core/algorithms.py

jbrockmendel · 2022-04-30T01:44:55Z

pandas/core/arrays/masked.py


        # check that factorize_array correctly preserves dtype.
        assert uniques.dtype == self.dtype.numpy_dtype, (uniques.dtype, self.dtype)

-        uniques_ea = type(self)(uniques, np.zeros(len(uniques), dtype=bool))
+        # Make room for a null value if we're not ignoring it and it exists


would it make sense to share any of this with the ArrowArray version? not for this PR, but could have a TODO

Yes, will add a TODO. Once we drop support for pyarrow < 4.0 we won't need this logic in ArrowArray, but 4.0 is only a year old at this point so that will be a while.

jbrockmendel · 2022-04-30T01:47:58Z

pandas/core/arrays/string_.py

+    @classmethod
+    def _from_factorized(cls, values, original):
+        assert values.dtype == original._ndarray.dtype
+        # When dropna (i.e. ignore_na) is False, can get -1 from nulls


is there any way we could avoid this?

Thanks - yes. Changed _values_for_factorize from -1 to None.

jbrockmendel · 2022-04-30T01:48:18Z

pandas/core/groupby/grouper.py

            # we make a CategoricalIndex out of the cat grouper
-            # preserving the categories / ordered attributes
+            # preserving the categories / ordered attributes;
+            # doesn't (yet) handle dropna=False


GH ref for the "yet"?

Opened #46909, will add a reference in this comment.

…orize_na

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

rhshadrach · 2022-07-09T15:55:37Z

@jbrockmendel - friendly ping

jbrockmendel · 2022-07-11T17:10:23Z

doc/source/whatsnew/v1.5.0.rst

@@ -278,6 +278,8 @@ Other enhancements
 - :meth:`DatetimeIndex.astype` now supports casting timezone-naive indexes to ``datetime64[s]``, ``datetime64[ms]``, and ``datetime64[us]``, and timezone-aware indexes to the corresponding ``datetime64[unit, tzname]`` dtypes (:issue:`47579`)
 - :class:`Series` reducers (e.g. ``min``, ``max``, ``sum``, ``mean``) will now successfully operate when the dtype is numeric and ``numeric_only=True`` is provided; previously this would raise a ``NotImplementedError`` (:issue:`47500`)
 - :meth:`RangeIndex.union` now can return a :class:`RangeIndex` instead of a :class:`Int64Index` if the resulting values are equally spaced (:issue:`47557`, :issue:`43885`)
+- The method :meth:`.ExtensionArray.factorize` can now be passed ``use_na_sentinel=False`` for determining how null values are to be treated. (:issue:`46601`)


"be passed" -> "takes" or "accepts", i.e. avoid passive voice

Thanks - fixed.

…orize_na

jbrockmendel · 2022-07-11T21:34:25Z

friendly ping

Thanks for your patience. This is near the top of my queue.

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst � pandas/tests/groupby/test_groupby_dropna.py

jbrockmendel · 2022-07-21T15:25:37Z

pandas/core/algorithms.py

            # Avoid using catch_warnings when possible
            # GH#46910 - TimelikeOps has deprecated signature
            codes, uniques = values.factorize(  # type: ignore[call-arg]
-                use_na_sentinel=True
+                use_na_sentinel=na_sentinel is not None


im trying to follow this and keep getting lost here. is any of this going to get simpler once deprecations are enforced?

Yes - but how much depends on what you mean by "this". There is a lot of playing around with arguments (what I think you're highlighting here) that will all go away. But what I regard as the main complication this PR introduces, noted in the TODO immediately below, will not go away just with the deprecation.

For that TODO, some changes to safe_sort are necessary. For the bulk of cases the changes are straightforward and only require very small changes. However if you have multiple Python types in an object array (e.g. float and string) with np.nan and None, I haven't yet to find a good way to sort. I plan to revisit this in the next few days.

I looked at safe_sort again, and @phofl already solved much of the issue in #47331. I opened a PR into this branch here to see what the execution of the TODO mentioned above would look like: rhshadrach#2

However, changing safe_sort in this way will induce another behavior change:

df = pd.DataFrame({'a': ['x', None, np.nan], 'b': [1, 2, 3]}) print(df.groupby('a', dropna=False).sum()) # main b a x 1 NaN 5 # feature b a x 1 None 2 NaN 3

No other tests besides the one changed failed locally. While I do think this is a bugfix, I'd like to study more what impact it has on concat/reshaping/indexing and I think it may need some discussion. In particular, I don't think it should be done in this PR.

@jbrockmendel - friendly ping.

While I do think this is a bugfix, I'd like to study more what impact it has on concat/reshaping/indexing and I think it may need some discussion. In particular, I don't think it should be done in this PR.

Agreed both that the "feature" behavior looks more correct and that it should be done separate from this PR.

Taking another look now.

(Coming in fairly cold), I think I'm getting lost too here. I am not fully comprehending why we check na_sentinel == -1 or na_sentinel is None and then use use_na_sentinel=na_sentinel is not None

Yea - there are some gymnastics here no doubt. The idea behind this block is to avoid using catch_warnings since it's possible.

Old API: na_sentinel is either an integer or None
New API: use_na_sentinel is False or True

The correspondence is:

use_na_sentinel False is equivalent to na_sentinel being None
use_na_sentinel True is equivalent to na_sentinel is -1

Note there is no option in the new API for na_sentinel being anything other than -1 or None in the old API. So we can use the new argument precisely when (a) the function has said argument and (b) na_sentinel is either -1 or None. In such a case, the correspondence from na_sentinel to use_na_sentinel is given by use_na_sentinel = na_sentinel is not None.

Perfect, just the explanation I needed. Thanks!

I'll add this correspondence as a comment to the top of this function; we can remove it when the deprecation is enforced.

…orize_na

jbrockmendel · 2022-08-14T19:06:50Z

@mroeschke im flailing on reviewing this. can you tag in?

mroeschke

One whatsnew comment, one comprehension comment, merge conflict; otherwise, LGTM

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

mroeschke

LGTM just one merge conflict

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

mroeschke · 2022-08-18T16:09:36Z

Great! Thanks for the effort on this @rhshadrach

…ev#46601) * BUG: algos.factorizes moves null values when sort=False * Implementation for pyarrow < 4.0 * fixup * fixups * test fixup * type-hints * improvements * remove override of _from_factorized in string_.py * Rework na_sentinel/dropna/ignore_na * fixups * fixup for pyarrow < 4.0 * whatsnew note * doc fixup * fixups * fixup whatsnew note * whatsnew note; comment on old vs new API Co-authored-by: asv-bot <pandas.benchmarks@gmail.com>

rhshadrach added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 1, 2022

rhshadrach commented Apr 1, 2022

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

rhshadrach commented Apr 1, 2022

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

rhshadrach commented Apr 1, 2022

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

rhshadrach commented Apr 1, 2022

View reviewed changes

jbrockmendel reviewed Apr 1, 2022

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Apr 1, 2022

View reviewed changes

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

BUG: algos.factorizes moves null values when sort=False

670c2e8

rhshadrach force-pushed the factorize_na branch from 8e66efb to 670c2e8 Compare April 19, 2022 21:55

rhshadrach changed the title ~~WIP/BUG: algorithms.factorize moves null values when sort=False~~ BUG: algorithms.factorize moves null values when sort=False Apr 19, 2022

rhshadrach marked this pull request as ready for review April 19, 2022 22:02

rhshadrach added 8 commits April 19, 2022 19:25

Implementation for pyarrow < 4.0

98c6c18

fixup

007329b

fixups

ffaf20c

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

58e5556

…orize_na

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

c600e9a

…orize_na

test fixup

f7326bd

type-hints

b0ec48a

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

395c9cf

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

jreback added this to the 1.5 milestone Apr 26, 2022

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

d0796ed

…orize_na

jreback approved these changes Apr 30, 2022

View reviewed changes

jbrockmendel reviewed Apr 30, 2022

View reviewed changes

pandas/core/algorithms.py Show resolved Hide resolved

jbrockmendel reviewed Apr 30, 2022

View reviewed changes

rhshadrach requested a review from jbrockmendel June 28, 2022 01:30

rhshadrach mentioned this pull request Jun 29, 2022

DOC: missed behavior explaination of sort=False for groupby #47529

Closed

1 task

rhshadrach added 3 commits July 2, 2022 06:58

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

0b85a3d

…orize_na

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

bc3f426

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

57a05a7

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

jbrockmendel reviewed Jul 11, 2022

View reviewed changes

rhshadrach added 2 commits July 11, 2022 16:34

fixup whatsnew note

9c35dd0

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

a7c3538

…orize_na

rhshadrach added 2 commits July 11, 2022 17:39

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

b27bda0

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

b45ace7

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst � pandas/tests/groupby/test_groupby_dropna.py

jbrockmendel reviewed Jul 21, 2022

View reviewed changes

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

c4cfbc6

…orize_na

mroeschke mentioned this pull request Aug 15, 2022

RLS: 1.5 #45223

Closed

mroeschke reviewed Aug 15, 2022

View reviewed changes

rhshadrach added 2 commits August 16, 2022 20:55

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

ecb182c

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

whatsnew note; comment on old vs new API

7143a52

rhshadrach requested a review from mroeschke August 17, 2022 12:00

mroeschke approved these changes Aug 17, 2022

View reviewed changes

Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…

82b61b6

…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst

mroeschke merged commit 56a7184 into pandas-dev:main Aug 18, 2022

rhshadrach deleted the factorize_na branch September 10, 2022 16:49

rhshadrach mentioned this pull request Sep 11, 2022

BUG: groupby doesn't identify null values when sort=False #48506

Closed

phofl mentioned this pull request Sep 19, 2022

BUG: dropna affects observed in DataFrame.groupby() since v1.5 #48645

Closed

3 tasks

sm-Fifteen mentioned this pull request Sep 26, 2022

BUG: pandas 1.5 fails to groupby on (nullable) Int64 column with dropna=False #48794

Closed

3 tasks

FenderJazz mentioned this pull request Oct 18, 2022

BUG: groupby aggregation with dropna=False, nullable integer dtype and NA generates NumPy IndexError #49173

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: algorithms.factorize moves null values when sort=False #46601

BUG: algorithms.factorize moves null values when sort=False #46601

rhshadrach commented Apr 1, 2022 •

edited

Loading

rhshadrach Apr 1, 2022

mroeschke Aug 15, 2022

rhshadrach Aug 15, 2022

rhshadrach commented Apr 19, 2022 •

edited

Loading

rhshadrach commented Apr 29, 2022

jreback left a comment

jbrockmendel Apr 30, 2022

rhshadrach Apr 30, 2022 •

edited

Loading

jbrockmendel Apr 30, 2022

rhshadrach Apr 30, 2022

jbrockmendel Apr 30, 2022

rhshadrach Apr 30, 2022

rhshadrach commented Jul 9, 2022

jbrockmendel Jul 11, 2022

rhshadrach Jul 11, 2022

jbrockmendel commented Jul 11, 2022

jbrockmendel Jul 21, 2022

rhshadrach Jul 21, 2022

rhshadrach Jul 23, 2022 •

edited

Loading

rhshadrach Jul 29, 2022

jbrockmendel Aug 4, 2022

mroeschke Aug 15, 2022

rhshadrach Aug 15, 2022

mroeschke Aug 15, 2022

rhshadrach Aug 15, 2022

jbrockmendel commented Aug 14, 2022

mroeschke left a comment

mroeschke left a comment

mroeschke commented Aug 18, 2022

BUG: algorithms.factorize moves null values when sort=False #46601

BUG: algorithms.factorize moves null values when sort=False #46601

Conversation

rhshadrach commented Apr 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Apr 19, 2022 • edited Loading

rhshadrach commented Apr 29, 2022

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Apr 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Jul 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jul 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Jul 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Aug 14, 2022

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Aug 18, 2022

rhshadrach commented Apr 1, 2022 •

edited

Loading

rhshadrach commented Apr 19, 2022 •

edited

Loading

rhshadrach Apr 30, 2022 •

edited

Loading

rhshadrach Jul 23, 2022 •

edited

Loading