Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stabilize sort_values algorithms for Series and time-like Indices #37310

Merged
merged 79 commits into from
Oct 31, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
151c425
BUG: stabilize sorting in Series.sort_values
AlexKirko Oct 20, 2020
7332255
DOC: add comment to nargsort call in Series.sort_values
AlexKirko Oct 20, 2020
546b9fa
use nargsort with indices: Period, DateTime, TimeDelta
AlexKirko Oct 21, 2020
12eb535
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 21, 2020
9b51d42
mv NaNs to the end of dupe lists in value_counts
AlexKirko Oct 21, 2020
7805d1e
CLN: remove extra comment indents
AlexKirko Oct 21, 2020
2b75f78
attempt to mimic previous count_values behavior by reversing before sort
AlexKirko Oct 21, 2020
2844a97
CLN: clean-up unnecessary import
AlexKirko Oct 21, 2020
965a547
Revert "attempt to mimic previous count_values behavior by reversing …
AlexKirko Oct 21, 2020
29d47ee
TST: alter tests in test_algos
AlexKirko Oct 21, 2020
151196d
TST: alter value_counts dupe order in boolean/test_function
AlexKirko Oct 21, 2020
d71bfc8
mv NaT to end of dupe sort order, alter test_value_counts
AlexKirko Oct 21, 2020
aff28ac
REFACT: use tuple unpacking for element swap
AlexKirko Oct 21, 2020
0b8aae9
DOC: clarify comments in algorithms/value_counts
AlexKirko Oct 22, 2020
e7cebc4
stop forcing NaN-like to be at the end of dupe order
AlexKirko Oct 22, 2020
06931e0
TST: NaN-like is now first among duplicates in count_values
AlexKirko Oct 22, 2020
0b24c3e
CLN: remove unnecessary is_bool import in series.py
AlexKirko Oct 22, 2020
1b98bff
TST: value_counts NaN dupe order change in test_string.py
AlexKirko Oct 22, 2020
4076f0c
TST: value_counts NaN dupe order in test_value_counts.py
AlexKirko Oct 22, 2020
08aadd3
CLN: rm unnecessary assignment from test_value_counts
AlexKirko Oct 22, 2020
6f904e6
TST: expect stable sort in extension/base/methods.py
AlexKirko Oct 22, 2020
5c7eea9
BUG: support objs that raise when cast to their class
AlexKirko Oct 22, 2020
75aad12
TST: fix stable sort expectation in test_sort_values in methods.py
AlexKirko Oct 22, 2020
e503dca
TST: change top expect for dupe counts in frame/test_describe
AlexKirko Oct 22, 2020
759de34
BUG: clean up crutches in Series.nlargest
AlexKirko Oct 22, 2020
12741f2
TST: change dupe order expect in frame/methods/test_value_counts
AlexKirko Oct 22, 2020
d433952
TST: change dupe order expectation in indexes/datetimes/test_ops
AlexKirko Oct 22, 2020
dc906df
TST: specify na-position in indexes/datetimes/test_ops.py
AlexKirko Oct 22, 2020
dbf295e
TST: specify na-position in indexes/period/test_ops.py
AlexKirko Oct 22, 2020
cbe528e
TST: remove xfail from test_order_stability_compat
AlexKirko Oct 22, 2020
6658b73
TST: change dupe order expect for indexes/timedeltas/test_ops
AlexKirko Oct 22, 2020
076fa7a
BUG: reintroduce ascending param error-catching to sort_values
AlexKirko Oct 22, 2020
cdf63a3
BUG: reintroduce proper key func support to Series.sort_values
AlexKirko Oct 22, 2020
dd30ec9
BUG: fix bug in key func support
AlexKirko Oct 22, 2020
5cac5c7
TST: alter dupe order expect in series/methods/test_value_counts
AlexKirko Oct 22, 2020
7a68a45
CLN: remove unused variable in Series.sort_values
AlexKirko Oct 23, 2020
4c72dbf
BUG: set values in key func support in Series.sort_values
AlexKirko Oct 23, 2020
3599591
DOC: add whatsnew
AlexKirko Oct 23, 2020
3ee9b8b
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 23, 2020
d63293c
BUG: add SparseArray sorting with key func support
AlexKirko Oct 23, 2020
a5c8f65
TST: remove datetime-like xfails when testing indices with missing
AlexKirko Oct 23, 2020
fc90ea9
TST: fix expect dupe sort order in doctests
AlexKirko Oct 23, 2020
408abe0
BUG: fix bug in Series reconstruction after sorting
AlexKirko Oct 23, 2020
5f53cfc
TST: fix expect dupe sort order in more doctests
AlexKirko Oct 23, 2020
4370f27
BUG: support key func changing ndarray dtype
AlexKirko Oct 23, 2020
6099344
REFACT: reuse keyed
AlexKirko Oct 23, 2020
bfa2b28
TST: remove datetime-like xfails from invalid_na_position
AlexKirko Oct 23, 2020
bc004ec
TST: fix expect dupe sort order in more doctests in base.py
AlexKirko Oct 23, 2020
cd7111e
CLN: remove unnecessary imports in indexes/test_common.py
AlexKirko Oct 23, 2020
5fbbf7d
CLN: run black
AlexKirko Oct 23, 2020
95b9f25
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 23, 2020
1dc9b89
incorporate both conflicting whatsnew
AlexKirko Oct 27, 2020
2946e46
try removing if in ensure_key_mapped to find tests
AlexKirko Oct 27, 2020
488596c
BUG: just apply key func to the whole Series in sort_values
AlexKirko Oct 27, 2020
c8cd8cd
remove legacy try/except with default to quicksort
AlexKirko Oct 27, 2020
ab71697
DOC: expand whatsnew
AlexKirko Oct 27, 2020
05f60d5
CLN: remove unnecessary SparseArray import
AlexKirko Oct 27, 2020
8669e89
restart tests
AlexKirko Oct 27, 2020
00b454c
restart tests again (windows_np18 misbehaving)
AlexKirko Oct 27, 2020
5d5f3d0
restart tests again
AlexKirko Oct 27, 2020
3d7f47c
DOC: add information on sorting NaTs to whatsnew
AlexKirko Oct 27, 2020
351a003
DOC: phrasing change in whatsnew
AlexKirko Oct 27, 2020
8de6ac9
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 28, 2020
18bb141
DOC: clarify whatsnew
AlexKirko Oct 28, 2020
9b97302
CLN: clean up unnecessary newlines in sorting.py
AlexKirko Oct 28, 2020
3a88ebe
DOC: clarify whatsnew some more
AlexKirko Oct 28, 2020
d41789e
bring back na_position validation in Series.sort_values
AlexKirko Oct 28, 2020
812f312
DOC: add to whatsnew
AlexKirko Oct 28, 2020
e28ce4d
DOC: clarify NaTs sorting changes in whatsnew
AlexKirko Oct 28, 2020
0719633
DOC: add other api changes to whatsnew; move doc there
AlexKirko Oct 28, 2020
61ac60d
CLN: run black
AlexKirko Oct 28, 2020
d495064
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 29, 2020
36932cd
DOC: attempt fixing malformed link in whatsnew
AlexKirko Oct 29, 2020
2156c64
Revert "DOC: attempt fixing malformed link in whatsnew"
AlexKirko Oct 29, 2020
e6f5741
DOC: fix broken link in whatsnew
AlexKirko Oct 29, 2020
c823043
restart tests
AlexKirko Oct 29, 2020
cd66748
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 30, 2020
37a6439
REFACT: clean up key if/else in Series.sort_values
AlexKirko Oct 31, 2020
d09da99
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -822,9 +822,19 @@ def value_counts_arraylike(values, dropna: bool):

mask = isna(values)
if not dropna and mask.any():
# GH 35922. Series.sort_values is stable now, so need to
# append NaN counts or move to the end to make sure they are
# sorted toward the end when calling value_counts
if not isna(keys).any():
keys = np.insert(keys, 0, np.NaN)
counts = np.insert(counts, 0, mask.sum())
keys = np.append(keys, np.NaN)
counts = np.append(counts, mask.sum())
else:
nan_pos = np.where(np.isnan(keys))
keys[nan_pos] = keys[-1]
keys[-1] = np.NaN
tmp = counts[nan_pos]
counts[nan_pos] = counts[-1]
counts[-1] = tmp
AlexKirko marked this conversation as resolved.
Show resolved Hide resolved

keys = _reconstruct_data(keys, original.dtype, original)

Expand Down
4 changes: 1 addition & 3 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -4510,9 +4510,7 @@ def sort_values(

# GH 35584. Sort missing values according to na_position kwarg
# ignore na_position for MultiIndex
if not isinstance(
self, (ABCMultiIndex, ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex)
):
if not isinstance(self, ABCMultiIndex):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support everything except MultiIndex (see PR header).

_as = nargsort(
items=idx, ascending=ascending, na_position=na_position, key=key
)
Expand Down
48 changes: 3 additions & 45 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@
from pandas.core.indexing import check_bool_indexer
from pandas.core.internals import SingleBlockManager
from pandas.core.shared_docs import _shared_docs
from pandas.core.sorting import ensure_key_mapped
from pandas.core.sorting import nargsort
from pandas.core.strings import StringMethods
from pandas.core.tools.datetimes import to_datetime

Expand Down Expand Up @@ -3274,52 +3274,10 @@ def sort_values(
"sort in-place you must create a copy"
)

def _try_kind_sort(arr):
arr = ensure_key_mapped(arr, key)
arr = getattr(arr, "_values", arr)

# easier to ask forgiveness than permission
try:
# if kind==mergesort, it can fail for object dtype
return arr.argsort(kind=kind)
except TypeError:
# stable sort not available for object dtype
# uses the argsort default quicksort
return arr.argsort(kind="quicksort")

arr = self._values
sorted_index = np.empty(len(self), dtype=np.int32)

bad = isna(arr)

good = ~bad
idx = ibase.default_index(len(self))

argsorted = _try_kind_sort(self[good])

if is_list_like(ascending):
if len(ascending) != 1:
raise ValueError(
f"Length of ascending ({len(ascending)}) must be 1 for Series"
)
ascending = ascending[0]

if not is_bool(ascending):
raise ValueError("ascending must be boolean")

if not ascending:
argsorted = argsorted[::-1]

if na_position == "last":
n = good.sum()
sorted_index[:n] = idx[good][argsorted]
sorted_index[n:] = idx[bad]
elif na_position == "first":
n = bad.sum()
sorted_index[n:] = idx[good][argsorted]
sorted_index[:n] = idx[bad]
else:
raise ValueError(f"invalid na_position: {na_position}")
# GH 35922. Make sorting stable by leveraging nargsort
sorted_index = nargsort(arr, kind, ascending, na_position, key)

result = self._constructor(arr[sorted_index], index=self.index[sorted_index])

Expand Down
11 changes: 10 additions & 1 deletion pandas/core/sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,16 @@ def nargsort(
if not ascending:
non_nans = non_nans[::-1]
non_nan_idx = non_nan_idx[::-1]
indexer = non_nan_idx[non_nans.argsort(kind=kind)]

# GH 35922. Move support for object sort here from Series.sort_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is not super useful, what is an expl of why you need this try/except e.g. L380 move here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant, since I removed the try/except.

try:
# if kind==mergesort, it can fail for object dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can this fail?

Copy link
Member Author

@AlexKirko AlexKirko Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I moved this code from Series.sort_values, fearing that I don't understand why it is necessary, but it doesn't seem to be. I've removed the try/except, tested sorting Series of object with different kind settings, and ran the test suite, and nothing is going wrong. I think we can safely drop it.

indexer = non_nan_idx[non_nans.argsort(kind=kind)]
except TypeError:
# stable sort not available for object dtype
# uses the argsort default quicksort
indexer = non_nan_idx[non_nans.argsort(kind="quicksort")]

if not ascending:
indexer = indexer[::-1]
# Finally, place the NaNs at the end or the beginning according to
Expand Down