-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: stabilize sort_values algorithms for Series and time-like Indices #37310
Conversation
Hello @AlexKirko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-10-31 04:50:22 UTC |
Stabilized the sorts, now looking for good ways to deal with 134 failing tests. Will likely take me a couple more days. Some tests have to be altered, but I'd like to change as few tests as necessary to get this done. |
…before sort" This reverts commit 2b75f78.
NaT is expected to be last in th elist of duplicate value counts guarantee that by finding it and moving to the end of the array (consider giving it up: the code ends up cluttered) Also alter test_value_counts to make sure that expectations match new sort_values stability
Also bring back the cast comment
@jreback
More detailed answers to your comments in replies above. The test fails on Win py38 and on 32-bit are happening for many PRs and have nothing to do with this one (Win 38 fails are going through troubleshooting in #37455, 32-bit fails are examined in #37473)
|
This reverts commit 36932cd.
@jreback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really a nice cleanup. just a small code comment, ping on green-ish.
pandas/core/series.py
Outdated
else: | ||
raise ValueError(f"invalid na_position: {na_position}") | ||
sorted_index = nargsort(self._values, kind, ascending, na_position) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, i think can write like
values_to_sort = ensure_key_mapped(self, key)._values if key else self._values
sorted_index = nargsort(self._values, kind, ascending, na_position)
pandas/core/series.py
Outdated
else: | ||
raise ValueError(f"invalid na_position: {na_position}") | ||
sorted_index = nargsort(self._values, kind, ascending, na_position) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave your comment on L3290 as well
if ser.nunique() == 2: | ||
expected = ser.iloc[[0, 1, 2]] | ||
else: | ||
expected = ser.iloc[[1, 0, 2]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is fine
@jreback |
thanks @AlexKirko very nice! |
Other API changes | ||
^^^^^^^^^^^^^^^^^ | ||
|
||
- Sorting in descending order is now stable for :meth:`Series.sort_values` and :meth:`Index.sort_values` for DateTime-like :class:`Index` subclasses. This will affect sort order when sorting :class:`DataFrame` on multiple columns, sorting with a key function that produces duplicates, or requesting the sorting index when using :meth:`Index.sort_values`. When using :meth:`Series.value_counts`, count of missing values is no longer the last in the list of duplicate counts, and its position corresponds to the position in the original :class:`Series`. When using :meth:`Index.sort_values` for DateTime-like :class:`Index` subclasses, NaTs ignored the ``na_position`` argument and were sorted to the beggining. Now they respect ``na_position``, the default being ``last``, same as other :class:`Index` subclasses. (:issue:`35992`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @AlexKirko, is it possible this is not only changed for datetime-like Index subclasses?
I see on released pandas this:
In [2]: pd.Index([0, 1, 0, 1]).value_counts()
Out[2]:
1 2
0 2
dtype: int64
while on master / 1.2.0rc:
In [2]: pd.Index([0, 1, 0, 1]).value_counts()
Out[2]:
0 2
1 2
dtype: int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, it's the resulting Series that is sorted, not the Index. And I suppose for Series it's for all types?
(this change turned out to uncover a bug in statsmodels, statsmodels/statsmodels#7215)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Problem
Our
sort_values
functions currently behave differently for different objects: for mostIndex
subclasses they are stable when sorting in descending order (this was introduced by #35604), but for DateTime-likeIndex
subclasses andSeries
they are unstable. This isn't good as sorting should be stable across the board.Details
Came across this one while introducing missing-value support to
Index.sort_values
in #35604, so I had to limit that PR to non-DateTime-likeIndex
subclasses. The problem was that we had different expectations for sorting stability baked into our test suite, so unifying sorting algorithms and missing-value support needed a bunch of careful test changes and altering bothsort_values
and algorithms insorting.py
.Since this PR necessarily includes changes in several places, I have commented on all the changes made in the code and the unusual changes in the tests to make reviewing the code easier (see "On test changes" below).
On test changes
Most changes I made in the tests are for cases where we were expecting an unstable sort or expected NaNs to be sorted to the beginning of a list of duplicates for ascending sort and to the end for descending (we forced this by inserting NaN-likes at 0 position and reversing when sorting in descending order in
Series.sort_values
).Default behavior changes
Since DateTime-like
Index
subclasses now supportna_position
using the same implementation as the otherIndex
subclasses, they now sort missing values to the end of theIndex
by default.Performance
Ran the full benchmark suite, and there are no performance regressions.
Out-of-scope
The only type I didn't touch so far is MultiIndex. It can't be sorted the same way through nargsort, and I don't think we should be doing it in this PR, if it all (stabilizing descending order MultiIndex.sort_values will definitely be a PITA, and it's a very narrow use case, in my opinion).