Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stabilize sort_values algorithms for Series and time-like Indices #37310

Merged
merged 79 commits into from
Oct 31, 2020

Conversation

AlexKirko
Copy link
Member

@AlexKirko AlexKirko commented Oct 21, 2020

Problem

Our sort_values functions currently behave differently for different objects: for most Index subclasses they are stable when sorting in descending order (this was introduced by #35604), but for DateTime-like Index subclasses and Series they are unstable. This isn't good as sorting should be stable across the board.

Details

Came across this one while introducing missing-value support to Index.sort_values in #35604, so I had to limit that PR to non-DateTime-like Index subclasses. The problem was that we had different expectations for sorting stability baked into our test suite, so unifying sorting algorithms and missing-value support needed a bunch of careful test changes and altering both sort_values and algorithms in sorting.py.

Since this PR necessarily includes changes in several places, I have commented on all the changes made in the code and the unusual changes in the tests to make reviewing the code easier (see "On test changes" below).

On test changes

Most changes I made in the tests are for cases where we were expecting an unstable sort or expected NaNs to be sorted to the beginning of a list of duplicates for ascending sort and to the end for descending (we forced this by inserting NaN-likes at 0 position and reversing when sorting in descending order in Series.sort_values).

Default behavior changes

Since DateTime-like Index subclasses now support na_position using the same implementation as the other Index subclasses, they now sort missing values to the end of the Index by default.

Performance

Ran the full benchmark suite, and there are no performance regressions.

Out-of-scope

The only type I didn't touch so far is MultiIndex. It can't be sorted the same way through nargsort, and I don't think we should be doing it in this PR, if it all (stabilizing descending order MultiIndex.sort_values will definitely be a PITA, and it's a very narrow use case, in my opinion).

@pep8speaks
Copy link

pep8speaks commented Oct 21, 2020

Hello @AlexKirko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-31 04:50:22 UTC

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Oct 21, 2020
@AlexKirko AlexKirko added the Bug label Oct 21, 2020
@AlexKirko AlexKirko added this to the 1.2 milestone Oct 21, 2020
@AlexKirko
Copy link
Member Author

Stabilized the sorts, now looking for good ways to deal with 134 failing tests. Will likely take me a couple more days. Some tests have to be altered, but I'd like to change as few tests as necessary to get this done.

pandas/core/algorithms.py Outdated Show resolved Hide resolved
@AlexKirko
Copy link
Member Author

AlexKirko commented Oct 28, 2020

@jreback
I think this is ready for another look.

  1. No longer process missing values separately in Series.sort_values, just pass everything to ensure_key_mapping.
  2. Since we now pass the Series instead of the underlying array, we can now safely cast in ensure_key_mapping.
  3. Clean up legacy try / except in nargsort (doesn't affect test suite, plus tested manually).
  4. Answer @jbrockmendel 's question and add the answer to the issue for future reference.
  5. Rewrite and expand whatsnew, create "Other API changes" section in whatsnew (under Enhancements), and put it there.

More detailed answers to your comments in replies above.

The test fails on Win py38 and on 32-bit are happening for many PRs and have nothing to do with this one (Win 38 fails are going through troubleshooting in #37455, 32-bit fails are examined in #37473)

Update: Web and Docs shouldn't be complaining, probably missed something. Checking them out. Had forgotten a colon. Fixed it.

@AlexKirko
Copy link
Member Author

AlexKirko commented Oct 30, 2020

@jreback
Most of the unrelated stuff has been fixed and merged. The Windows py38_np18 pipeline still fails, but that's unrelated and being examined in #37455 by @jbrockmendel

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really a nice cleanup. just a small code comment, ping on green-ish.

else:
raise ValueError(f"invalid na_position: {na_position}")
sorted_index = nargsort(self._values, kind, ascending, na_position)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, i think can write like

values_to_sort = ensure_key_mapped(self, key)._values if key else self._values
sorted_index = nargsort(self._values, kind, ascending, na_position)

else:
raise ValueError(f"invalid na_position: {na_position}")
sorted_index = nargsort(self._values, kind, ascending, na_position)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave your comment on L3290 as well

if ser.nunique() == 2:
expected = ser.iloc[[0, 1, 2]]
else:
expected = ser.iloc[[1, 0, 2]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine

@AlexKirko
Copy link
Member Author

@jreback
Made the change, all green-ish.

@jreback jreback merged commit 109ee11 into pandas-dev:master Oct 31, 2020
@jreback
Copy link
Contributor

jreback commented Oct 31, 2020

thanks @AlexKirko very nice!

@AlexKirko AlexKirko deleted the stable-dupe-sort branch November 5, 2020 08:13
Other API changes
^^^^^^^^^^^^^^^^^

- Sorting in descending order is now stable for :meth:`Series.sort_values` and :meth:`Index.sort_values` for DateTime-like :class:`Index` subclasses. This will affect sort order when sorting :class:`DataFrame` on multiple columns, sorting with a key function that produces duplicates, or requesting the sorting index when using :meth:`Index.sort_values`. When using :meth:`Series.value_counts`, count of missing values is no longer the last in the list of duplicate counts, and its position corresponds to the position in the original :class:`Series`. When using :meth:`Index.sort_values` for DateTime-like :class:`Index` subclasses, NaTs ignored the ``na_position`` argument and were sorted to the beggining. Now they respect ``na_position``, the default being ``last``, same as other :class:`Index` subclasses. (:issue:`35992`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AlexKirko, is it possible this is not only changed for datetime-like Index subclasses?

I see on released pandas this:

In [2]: pd.Index([0, 1, 0, 1]).value_counts()
Out[2]: 
1    2
0    2
dtype: int64

while on master / 1.2.0rc:

In [2]: pd.Index([0, 1, 0, 1]).value_counts()
Out[2]: 
0    2
1    2
dtype: int64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, it's the resulting Series that is sorted, not the Index. And I suppose for Series it's for all types?

(this change turned out to uncover a bug in statsmodels, statsmodels/statsmodels#7215)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Index.sort_values and Series.sort_values reverse duplicate order when ascending=False
5 participants