BUG: stabilize sort_values algorithms for Series and time-like Indices #37310

AlexKirko · 2020-10-21T12:52:49Z

closes BUG: Index.sort_values and Series.sort_values reverse duplicate order when ascending=False #35922
18 tests changed / 18 passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Problem

Our sort_values functions currently behave differently for different objects: for most Index subclasses they are stable when sorting in descending order (this was introduced by #35604), but for DateTime-like Index subclasses and Series they are unstable. This isn't good as sorting should be stable across the board.

Details

Came across this one while introducing missing-value support to Index.sort_values in #35604, so I had to limit that PR to non-DateTime-like Index subclasses. The problem was that we had different expectations for sorting stability baked into our test suite, so unifying sorting algorithms and missing-value support needed a bunch of careful test changes and altering both sort_values and algorithms in sorting.py.

Since this PR necessarily includes changes in several places, I have commented on all the changes made in the code and the unusual changes in the tests to make reviewing the code easier (see "On test changes" below).

On test changes

Most changes I made in the tests are for cases where we were expecting an unstable sort or expected NaNs to be sorted to the beginning of a list of duplicates for ascending sort and to the end for descending (we forced this by inserting NaN-likes at 0 position and reversing when sorting in descending order in Series.sort_values).

Default behavior changes

Since DateTime-like Index subclasses now support na_position using the same implementation as the other Index subclasses, they now sort missing values to the end of the Index by default.

Performance

Ran the full benchmark suite, and there are no performance regressions.

Out-of-scope

The only type I didn't touch so far is MultiIndex. It can't be sorted the same way through nargsort, and I don't think we should be doing it in this PR, if it all (stabilizing descending order MultiIndex.sort_values will definitely be a PITA, and it's a very narrow use case, in my opinion).

pep8speaks · 2020-10-21T12:52:54Z

Hello @AlexKirko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-31 04:50:22 UTC

AlexKirko · 2020-10-21T14:04:24Z

Stabilized the sorts, now looking for good ways to deal with 134 failing tests. Will likely take me a couple more days. Some tests have to be altered, but I'd like to change as few tests as necessary to get this done.

…before sort" This reverts commit 2b75f78.

pandas/core/algorithms.py

NaT is expected to be last in th elist of duplicate value counts guarantee that by finding it and moving to the end of the array (consider giving it up: the code ends up cluttered) Also alter test_value_counts to make sure that expectations match new sort_values stability

Also bring back the cast comment

AlexKirko · 2020-10-28T08:29:41Z

@jreback
I think this is ready for another look.

No longer process missing values separately in Series.sort_values, just pass everything to ensure_key_mapping.
Since we now pass the Series instead of the underlying array, we can now safely cast in ensure_key_mapping.
Clean up legacy try / except in nargsort (doesn't affect test suite, plus tested manually).
Answer @jbrockmendel 's question and add the answer to the issue for future reference.
Rewrite and expand whatsnew, create "Other API changes" section in whatsnew (under Enhancements), and put it there.

More detailed answers to your comments in replies above.

The test fails on Win py38 and on 32-bit are happening for many PRs and have nothing to do with this one (Win 38 fails are going through troubleshooting in #37455, 32-bit fails are examined in #37473)

~~Update: Web and Docs shouldn't be complaining, probably missed something. Checking them out.~~ Had forgotten a colon. Fixed it.

This reverts commit 36932cd.

AlexKirko · 2020-10-30T07:54:12Z

@jreback
Most of the unrelated stuff has been fixed and merged. The Windows py38_np18 pipeline still fails, but that's unrelated and being examined in #37455 by @jbrockmendel

jreback

really a nice cleanup. just a small code comment, ping on green-ish.

jreback · 2020-10-30T14:53:07Z

pandas/core/series.py

        else:
-            raise ValueError(f"invalid na_position: {na_position}")
+            sorted_index = nargsort(self._values, kind, ascending, na_position)


great, i think can write like

values_to_sort = ensure_key_mapped(self, key)._values if key else self._values sorted_index = nargsort(self._values, kind, ascending, na_position)

jreback · 2020-10-30T14:53:21Z

pandas/core/series.py

        else:
-            raise ValueError(f"invalid na_position: {na_position}")
+            sorted_index = nargsort(self._values, kind, ascending, na_position)


leave your comment on L3290 as well

jreback · 2020-10-30T14:54:05Z

pandas/tests/extension/base/methods.py

+            if ser.nunique() == 2:
+                expected = ser.iloc[[0, 1, 2]]
+            else:
+                expected = ser.iloc[[1, 0, 2]]


ok this is fine

AlexKirko · 2020-10-31T05:52:15Z

@jreback
Made the change, all green-ish.

jreback · 2020-10-31T14:49:19Z

thanks @AlexKirko very nice!

pandas-dev#37310)

jorisvandenbossche · 2020-12-23T17:46:05Z

doc/source/whatsnew/v1.2.0.rst

+Other API changes
+^^^^^^^^^^^^^^^^^
+
+- Sorting in descending order is now stable for :meth:`Series.sort_values` and :meth:`Index.sort_values` for DateTime-like :class:`Index` subclasses. This will affect sort order when sorting :class:`DataFrame` on multiple columns, sorting with a key function that produces duplicates, or requesting the sorting index when using :meth:`Index.sort_values`. When using :meth:`Series.value_counts`, count of missing values is no longer the last in the list of duplicate counts, and its position corresponds to the position in the original :class:`Series`. When using :meth:`Index.sort_values` for DateTime-like :class:`Index` subclasses, NaTs ignored the ``na_position`` argument and were sorted to the beggining. Now they respect ``na_position``, the default being ``last``, same as other :class:`Index` subclasses. (:issue:`35992`)


Hey @AlexKirko, is it possible this is not only changed for datetime-like Index subclasses?

I see on released pandas this:

In [2]: pd.Index([0, 1, 0, 1]).value_counts() Out[2]: 1 2 0 2 dtype: int64

while on master / 1.2.0rc:

In [2]: pd.Index([0, 1, 0, 1]).value_counts() Out[2]: 0 2 1 2 dtype: int64

Of course, it's the resulting Series that is sorted, not the Index. And I suppose for Series it's for all types?

(this change turned out to uncover a bug in statsmodels, statsmodels/statsmodels#7215)

AlexKirko added 5 commits October 20, 2020 14:26

BUG: stabilize sorting in Series.sort_values

151c425

DOC: add comment to nargsort call in Series.sort_values

7332255

use nargsort with indices: Period, DateTime, TimeDelta

546b9fa

Merge branch 'master' into stable-dupe-sort

12eb535

mv NaNs to the end of dupe lists in value_counts

9b51d42

CLN: remove extra comment indents

7805d1e

jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Oct 21, 2020

attempt to mimic previous count_values behavior by reversing before sort

2b75f78

AlexKirko added the Bug label Oct 21, 2020

AlexKirko added this to the 1.2 milestone Oct 21, 2020

AlexKirko added 3 commits October 21, 2020 17:08

CLN: clean-up unnecessary import

2844a97

Revert "attempt to mimic previous count_values behavior by reversing …

965a547

…before sort" This reverts commit 2b75f78.

TST: alter tests in test_algos

29d47ee

AlexKirko commented Oct 21, 2020

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

AlexKirko added 14 commits October 21, 2020 17:58

TST: alter value_counts dupe order in boolean/test_function

151196d

REFACT: use tuple unpacking for element swap

aff28ac

DOC: clarify comments in algorithms/value_counts

0b8aae9

stop forcing NaN-like to be at the end of dupe order

e7cebc4

TST: NaN-like is now first among duplicates in count_values

06931e0

CLN: remove unnecessary is_bool import in series.py

0b24c3e

TST: value_counts NaN dupe order change in test_string.py

1b98bff

TST: value_counts NaN dupe order in test_value_counts.py

4076f0c

CLN: rm unnecessary assignment from test_value_counts

08aadd3

TST: expect stable sort in extension/base/methods.py

6f904e6

BUG: support objs that raise when cast to their class

5c7eea9

TST: fix stable sort expectation in test_sort_values in methods.py

75aad12

TST: change top expect for dupe counts in frame/test_describe

e503dca

AlexKirko added 9 commits October 28, 2020 10:03

Merge branch 'master' into stable-dupe-sort

8de6ac9

DOC: clarify whatsnew

18bb141

CLN: clean up unnecessary newlines in sorting.py

9b97302

Also bring back the cast comment

DOC: clarify whatsnew some more

3a88ebe

bring back na_position validation in Series.sort_values

d41789e

DOC: add to whatsnew

812f312

DOC: clarify NaTs sorting changes in whatsnew

e28ce4d

DOC: add other api changes to whatsnew; move doc there

0719633

CLN: run black

61ac60d

AlexKirko requested a review from jreback October 28, 2020 08:29

AlexKirko added 6 commits October 29, 2020 09:39

Merge branch 'master' into stable-dupe-sort

d495064

DOC: attempt fixing malformed link in whatsnew

36932cd

Revert "DOC: attempt fixing malformed link in whatsnew"

2156c64

This reverts commit 36932cd.

DOC: fix broken link in whatsnew

e6f5741

restart tests

c823043

Merge branch 'master' into stable-dupe-sort

cd66748

jreback requested changes Oct 30, 2020

View reviewed changes

AlexKirko added 2 commits October 31, 2020 07:48

REFACT: clean up key if/else in Series.sort_values

37a6439

Merge branch 'master' into stable-dupe-sort

d09da99

AlexKirko requested a review from jreback October 31, 2020 05:51

jreback approved these changes Oct 31, 2020

View reviewed changes

jreback merged commit 109ee11 into pandas-dev:master Oct 31, 2020

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

BUG: stabilize sort_values algorithms for Series and time-like Indices (

2d41cf9

pandas-dev#37310)

ukarroum pushed a commit to ukarroum/pandas that referenced this pull request Nov 2, 2020

BUG: stabilize sort_values algorithms for Series and time-like Indices (

b9c9afe

pandas-dev#37310)

AlexKirko deleted the stable-dupe-sort branch November 5, 2020 08:13

jorisvandenbossche reviewed Dec 23, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: stabilize sort_values algorithms for Series and time-like Indices #37310

BUG: stabilize sort_values algorithms for Series and time-like Indices #37310

Uh oh!

AlexKirko commented Oct 21, 2020 •

edited

Loading

Uh oh!

pep8speaks commented Oct 21, 2020 •

edited

Loading

Uh oh!

AlexKirko commented Oct 21, 2020

Uh oh!

Uh oh!

AlexKirko commented Oct 28, 2020 •

edited

Loading

Uh oh!

AlexKirko commented Oct 30, 2020 •

edited

Loading

Uh oh!

jreback left a comment

Uh oh!

jreback Oct 30, 2020

Uh oh!

jreback Oct 30, 2020

Uh oh!

jreback Oct 30, 2020

Uh oh!

AlexKirko commented Oct 31, 2020

Uh oh!

jreback commented Oct 31, 2020

Uh oh!

jorisvandenbossche Dec 23, 2020

Uh oh!

jorisvandenbossche Dec 23, 2020

Uh oh!

Uh oh!

Uh oh!

BUG: stabilize sort_values algorithms for Series and time-like Indices #37310

BUG: stabilize sort_values algorithms for Series and time-like Indices #37310

Uh oh!

Conversation

AlexKirko commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Details

On test changes

Default behavior changes

Performance

Out-of-scope

Uh oh!

pep8speaks commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-10-31 04:50:22 UTC

Uh oh!

AlexKirko commented Oct 21, 2020

Uh oh!

Uh oh!

AlexKirko commented Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexKirko commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

AlexKirko commented Oct 31, 2020

Uh oh!

jreback commented Oct 31, 2020

Uh oh!

jorisvandenbossche Dec 23, 2020

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Dec 23, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AlexKirko commented Oct 21, 2020 •

edited

Loading

pep8speaks commented Oct 21, 2020 •

edited

Loading

AlexKirko commented Oct 28, 2020 •

edited

Loading

AlexKirko commented Oct 30, 2020 •

edited

Loading