Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stabilize sort_values algorithms for Series and time-like Indices #37310

Merged
merged 79 commits into from
Oct 31, 2020
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
151c425
BUG: stabilize sorting in Series.sort_values
AlexKirko Oct 20, 2020
7332255
DOC: add comment to nargsort call in Series.sort_values
AlexKirko Oct 20, 2020
546b9fa
use nargsort with indices: Period, DateTime, TimeDelta
AlexKirko Oct 21, 2020
12eb535
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 21, 2020
9b51d42
mv NaNs to the end of dupe lists in value_counts
AlexKirko Oct 21, 2020
7805d1e
CLN: remove extra comment indents
AlexKirko Oct 21, 2020
2b75f78
attempt to mimic previous count_values behavior by reversing before sort
AlexKirko Oct 21, 2020
2844a97
CLN: clean-up unnecessary import
AlexKirko Oct 21, 2020
965a547
Revert "attempt to mimic previous count_values behavior by reversing …
AlexKirko Oct 21, 2020
29d47ee
TST: alter tests in test_algos
AlexKirko Oct 21, 2020
151196d
TST: alter value_counts dupe order in boolean/test_function
AlexKirko Oct 21, 2020
d71bfc8
mv NaT to end of dupe sort order, alter test_value_counts
AlexKirko Oct 21, 2020
aff28ac
REFACT: use tuple unpacking for element swap
AlexKirko Oct 21, 2020
0b8aae9
DOC: clarify comments in algorithms/value_counts
AlexKirko Oct 22, 2020
e7cebc4
stop forcing NaN-like to be at the end of dupe order
AlexKirko Oct 22, 2020
06931e0
TST: NaN-like is now first among duplicates in count_values
AlexKirko Oct 22, 2020
0b24c3e
CLN: remove unnecessary is_bool import in series.py
AlexKirko Oct 22, 2020
1b98bff
TST: value_counts NaN dupe order change in test_string.py
AlexKirko Oct 22, 2020
4076f0c
TST: value_counts NaN dupe order in test_value_counts.py
AlexKirko Oct 22, 2020
08aadd3
CLN: rm unnecessary assignment from test_value_counts
AlexKirko Oct 22, 2020
6f904e6
TST: expect stable sort in extension/base/methods.py
AlexKirko Oct 22, 2020
5c7eea9
BUG: support objs that raise when cast to their class
AlexKirko Oct 22, 2020
75aad12
TST: fix stable sort expectation in test_sort_values in methods.py
AlexKirko Oct 22, 2020
e503dca
TST: change top expect for dupe counts in frame/test_describe
AlexKirko Oct 22, 2020
759de34
BUG: clean up crutches in Series.nlargest
AlexKirko Oct 22, 2020
12741f2
TST: change dupe order expect in frame/methods/test_value_counts
AlexKirko Oct 22, 2020
d433952
TST: change dupe order expectation in indexes/datetimes/test_ops
AlexKirko Oct 22, 2020
dc906df
TST: specify na-position in indexes/datetimes/test_ops.py
AlexKirko Oct 22, 2020
dbf295e
TST: specify na-position in indexes/period/test_ops.py
AlexKirko Oct 22, 2020
cbe528e
TST: remove xfail from test_order_stability_compat
AlexKirko Oct 22, 2020
6658b73
TST: change dupe order expect for indexes/timedeltas/test_ops
AlexKirko Oct 22, 2020
076fa7a
BUG: reintroduce ascending param error-catching to sort_values
AlexKirko Oct 22, 2020
cdf63a3
BUG: reintroduce proper key func support to Series.sort_values
AlexKirko Oct 22, 2020
dd30ec9
BUG: fix bug in key func support
AlexKirko Oct 22, 2020
5cac5c7
TST: alter dupe order expect in series/methods/test_value_counts
AlexKirko Oct 22, 2020
7a68a45
CLN: remove unused variable in Series.sort_values
AlexKirko Oct 23, 2020
4c72dbf
BUG: set values in key func support in Series.sort_values
AlexKirko Oct 23, 2020
3599591
DOC: add whatsnew
AlexKirko Oct 23, 2020
3ee9b8b
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 23, 2020
d63293c
BUG: add SparseArray sorting with key func support
AlexKirko Oct 23, 2020
a5c8f65
TST: remove datetime-like xfails when testing indices with missing
AlexKirko Oct 23, 2020
fc90ea9
TST: fix expect dupe sort order in doctests
AlexKirko Oct 23, 2020
408abe0
BUG: fix bug in Series reconstruction after sorting
AlexKirko Oct 23, 2020
5f53cfc
TST: fix expect dupe sort order in more doctests
AlexKirko Oct 23, 2020
4370f27
BUG: support key func changing ndarray dtype
AlexKirko Oct 23, 2020
6099344
REFACT: reuse keyed
AlexKirko Oct 23, 2020
bfa2b28
TST: remove datetime-like xfails from invalid_na_position
AlexKirko Oct 23, 2020
bc004ec
TST: fix expect dupe sort order in more doctests in base.py
AlexKirko Oct 23, 2020
cd7111e
CLN: remove unnecessary imports in indexes/test_common.py
AlexKirko Oct 23, 2020
5fbbf7d
CLN: run black
AlexKirko Oct 23, 2020
95b9f25
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 23, 2020
1dc9b89
incorporate both conflicting whatsnew
AlexKirko Oct 27, 2020
2946e46
try removing if in ensure_key_mapped to find tests
AlexKirko Oct 27, 2020
488596c
BUG: just apply key func to the whole Series in sort_values
AlexKirko Oct 27, 2020
c8cd8cd
remove legacy try/except with default to quicksort
AlexKirko Oct 27, 2020
ab71697
DOC: expand whatsnew
AlexKirko Oct 27, 2020
05f60d5
CLN: remove unnecessary SparseArray import
AlexKirko Oct 27, 2020
8669e89
restart tests
AlexKirko Oct 27, 2020
00b454c
restart tests again (windows_np18 misbehaving)
AlexKirko Oct 27, 2020
5d5f3d0
restart tests again
AlexKirko Oct 27, 2020
3d7f47c
DOC: add information on sorting NaTs to whatsnew
AlexKirko Oct 27, 2020
351a003
DOC: phrasing change in whatsnew
AlexKirko Oct 27, 2020
8de6ac9
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 28, 2020
18bb141
DOC: clarify whatsnew
AlexKirko Oct 28, 2020
9b97302
CLN: clean up unnecessary newlines in sorting.py
AlexKirko Oct 28, 2020
3a88ebe
DOC: clarify whatsnew some more
AlexKirko Oct 28, 2020
d41789e
bring back na_position validation in Series.sort_values
AlexKirko Oct 28, 2020
812f312
DOC: add to whatsnew
AlexKirko Oct 28, 2020
e28ce4d
DOC: clarify NaTs sorting changes in whatsnew
AlexKirko Oct 28, 2020
0719633
DOC: add other api changes to whatsnew; move doc there
AlexKirko Oct 28, 2020
61ac60d
CLN: run black
AlexKirko Oct 28, 2020
d495064
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 29, 2020
36932cd
DOC: attempt fixing malformed link in whatsnew
AlexKirko Oct 29, 2020
2156c64
Revert "DOC: attempt fixing malformed link in whatsnew"
AlexKirko Oct 29, 2020
e6f5741
DOC: fix broken link in whatsnew
AlexKirko Oct 29, 2020
c823043
restart tests
AlexKirko Oct 29, 2020
cd66748
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 30, 2020
37a6439
REFACT: clean up key if/else in Series.sort_values
AlexKirko Oct 31, 2020
d09da99
Merge branch 'master' into stable-dupe-sort
AlexKirko Oct 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -531,6 +531,7 @@ Other
- Fixed metadata propagation in the :class:`Series.dt` and :class:`Series.str` accessors and :class:`DataFrame.duplicated` and ::class:`DataFrame.stack` methods (:issue:`28283`)
- Bug in :meth:`Index.union` behaving differently depending on whether operand is a :class:`Index` or other list-like (:issue:`36384`)
- Passing an array with 2 or more dimensions to the :class:`Series` constructor now raises the more specific ``ValueError``, from a bare ``Exception`` previously (:issue:`35744`)
- Sorting in descending order being unstable when using :meth:`Series.sort_values` and :meth:`Index.sort_values` for DateTime-like :class:`Index` subclasses (:issue:`35992`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand this slightly to say that the sorted results will be different than prior version, I think move this to the API changes section as well.

Copy link
Member Author

@AlexKirko AlexKirko Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Expanded the message quite a bit, created the "Other API changes" section in whatsnew, and moved the doc there.


.. ---------------------------------------------------------------------------

Expand Down
4 changes: 1 addition & 3 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -1181,10 +1181,8 @@ def compute(self, method: str) -> Series:

# slow method
if n >= len(self.obj):
reverse_it = self.keep == "last" or method == "nlargest"
ascending = method == "nsmallest"
slc = np.s_[::-1] if reverse_it else np.s_[:]
return dropped[slc].sort_values(ascending=ascending).head(n)
return dropped.sort_values(ascending=ascending).head(n)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that sort_values is stable, we don't need this extra reversal stuff for nlargest (it was used to stabilize the sort).


# fast method
arr, pandas_dtype = _ensure_data(dropped.values)
Expand Down
16 changes: 8 additions & 8 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -933,9 +933,9 @@ def value_counts(
>>> index = pd.Index([3, 1, 2, 3, 4, np.nan])
>>> index.value_counts()
3.0 2
4.0 1
2.0 1
1.0 1
2.0 1
4.0 1
dtype: int64

With `normalize` set to `True`, returns the relative frequency by
Expand All @@ -944,9 +944,9 @@ def value_counts(
>>> s = pd.Series([3, 1, 2, 3, 4, np.nan])
>>> s.value_counts(normalize=True)
3.0 0.4
4.0 0.2
2.0 0.2
1.0 0.2
2.0 0.2
4.0 0.2
dtype: float64

**bins**
Expand All @@ -957,8 +957,8 @@ def value_counts(
number of half-open bins.

>>> s.value_counts(bins=3)
(2.0, 3.0] 2
(0.996, 2.0] 2
(2.0, 3.0] 2
(3.0, 4.0] 1
dtype: int64

Expand All @@ -968,10 +968,10 @@ def value_counts(

>>> s.value_counts(dropna=False)
3.0 2
NaN 1
4.0 1
2.0 1
1.0 1
2.0 1
4.0 1
NaN 1
dtype: int64
"""
result = value_counts(
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5535,8 +5535,8 @@ def value_counts(
>>> df.value_counts()
num_legs num_wings
4 0 2
6 0 1
2 2 1
6 0 1
dtype: int64

>>> df.value_counts(sort=False)
Expand All @@ -5556,8 +5556,8 @@ def value_counts(
>>> df.value_counts(normalize=True)
num_legs num_wings
4 0 0.50
6 0 0.25
2 2 0.25
6 0 0.25
dtype: float64
"""
if subset is None:
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -10108,7 +10108,7 @@ def describe(
categorical
count 3
unique 3
top f
top d
freq 1

Excluding numeric columns from a ``DataFrame`` description.
Expand Down
4 changes: 1 addition & 3 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -4525,9 +4525,7 @@ def sort_values(

# GH 35584. Sort missing values according to na_position kwarg
# ignore na_position for MultiIndex
if not isinstance(
self, (ABCMultiIndex, ABCDatetimeIndex, ABCTimedeltaIndex, ABCPeriodIndex)
):
if not isinstance(self, ABCMultiIndex):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support everything except MultiIndex (see PR header).

_as = nargsort(
items=idx, ascending=ascending, na_position=na_position, key=key
)
Expand Down
58 changes: 19 additions & 39 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@
from pandas.core.aggregation import aggregate, transform
from pandas.core.arrays import ExtensionArray
from pandas.core.arrays.categorical import CategoricalAccessor
from pandas.core.arrays.sparse import SparseAccessor
from pandas.core.arrays.sparse import SparseAccessor, SparseArray
import pandas.core.common as com
from pandas.core.construction import (
array as pd_array,
Expand All @@ -92,7 +92,7 @@
from pandas.core.indexing import check_bool_indexer
from pandas.core.internals import SingleBlockManager
from pandas.core.shared_docs import _shared_docs
from pandas.core.sorting import ensure_key_mapped
from pandas.core.sorting import ensure_key_mapped, nargsort
from pandas.core.strings import StringMethods
from pandas.core.tools.datetimes import to_datetime

Expand Down Expand Up @@ -3274,29 +3274,6 @@ def sort_values(
"sort in-place you must create a copy"
)

def _try_kind_sort(arr):
arr = ensure_key_mapped(arr, key)
arr = getattr(arr, "_values", arr)

# easier to ask forgiveness than permission
try:
# if kind==mergesort, it can fail for object dtype
return arr.argsort(kind=kind)
except TypeError:
# stable sort not available for object dtype
# uses the argsort default quicksort
return arr.argsort(kind="quicksort")

arr = self._values
sorted_index = np.empty(len(self), dtype=np.int32)

bad = isna(arr)

good = ~bad
idx = ibase.default_index(len(self))

argsorted = _try_kind_sort(self[good])

if is_list_like(ascending):
if len(ascending) != 1:
raise ValueError(
Expand All @@ -3307,21 +3284,24 @@ def _try_kind_sort(arr):
if not is_bool(ascending):
raise ValueError("ascending must be boolean")

if not ascending:
argsorted = argsorted[::-1]

if na_position == "last":
n = good.sum()
sorted_index[:n] = idx[good][argsorted]
sorted_index[n:] = idx[bad]
elif na_position == "first":
n = bad.sum()
sorted_index[n:] = idx[good][argsorted]
sorted_index[:n] = idx[bad]
else:
raise ValueError(f"invalid na_position: {na_position}")
arr = self._values

result = self._constructor(arr[sorted_index], index=self.index[sorted_index])
if key:
if isinstance(arr, SparseArray):
# SparseArray doesn't store NaNs item-by-item, so pass everything
arr = ensure_key_mapped(self, key)._values
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparseArrays can't be set by assignment, so just pass everything. Key func automatically runs on non-nans.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it would be nice to unify this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but so far have been unable to. Would appreciate any suggestions.

The problem is that SparseArray doesn't support setting elements, so you can't do arr[good] = something. The workarounds I was able to Google are far too ugly to use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to convert this to _values beforet he assignment? i think that would make this agnostic (and just select ._valuses when passing nargsort

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Since we got rid of processing non-missing and missing values separately, this is now fixed.

else:
good = ~isna(arr)
keyed = ensure_key_mapped(self[good], key)._values
arr = arr.astype(keyed.dtype)
arr[good] = keyed
AlexKirko marked this conversation as resolved.
Show resolved Hide resolved

# GH 35922. Make sorting stable by leveraging nargsort
sorted_index = nargsort(arr, kind, ascending, na_position)

result = self._constructor(
self._values[sorted_index], index=self.index[sorted_index]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we change arr when we sort with a key function, use original values here.


if ignore_index:
result.index = ibase.default_index(len(sorted_index))
Expand Down
16 changes: 14 additions & 2 deletions pandas/core/sorting.py
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,16 @@ def nargsort(
if not ascending:
non_nans = non_nans[::-1]
non_nan_idx = non_nan_idx[::-1]
indexer = non_nan_idx[non_nans.argsort(kind=kind)]

# GH 35922. Move support for object sort here from Series.sort_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is not super useful, what is an expl of why you need this try/except e.g. L380 move here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant, since I removed the try/except.

try:
# if kind==mergesort, it can fail for object dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can this fail?

Copy link
Member Author

@AlexKirko AlexKirko Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I moved this code from Series.sort_values, fearing that I don't understand why it is necessary, but it doesn't seem to be. I've removed the try/except, tested sorting Series of object with different kind settings, and ran the test suite, and nothing is going wrong. I think we can safely drop it.

indexer = non_nan_idx[non_nans.argsort(kind=kind)]
except TypeError:
# stable sort not available for object dtype
# uses the argsort default quicksort
indexer = non_nan_idx[non_nans.argsort(kind="quicksort")]

if not ascending:
indexer = indexer[::-1]
# Finally, place the NaNs at the end or the beginning according to
Expand Down Expand Up @@ -499,7 +508,10 @@ def ensure_key_mapped(values, key: Optional[Callable], levels=None):
result = Index(result)
else:
type_of_values = type(values)
result = type_of_values(result) # try to revert to original type otherwise
# GH 35922. Support sorting objects that raise when cast to their type
if not isinstance(result, type_of_values):
# try to revert to original type otherwise
result = type_of_values(result)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the supported types don't like being cast to themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this fail on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like other fixes made this safeguard unnecessary. Now that we pass a Series into the function, there are no more failures. If I remember correctly, it was failing on tz-aware datetimes, when a key function would strip the timezone information, and then the cast would fail, because it wouldn't know what to do with the timezones in values.

except TypeError:
raise TypeError(
f"User-provided `key` function returned an invalid type {type(result)} \
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/arrays/boolean/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,11 +77,11 @@ def test_ufunc_reduce_raises(values):
def test_value_counts_na():
arr = pd.array([True, False, pd.NA], dtype="boolean")
result = arr.value_counts(dropna=False)
expected = pd.Series([1, 1, 1], index=[True, False, pd.NA], dtype="Int64")
expected = pd.Series([1, 1, 1], index=[False, True, pd.NA], dtype="Int64")
tm.assert_series_equal(result, expected)

result = arr.value_counts(dropna=True)
expected = pd.Series([1, 1], index=[True, False], dtype="Int64")
expected = pd.Series([1, 1], index=[False, True], dtype="Int64")
tm.assert_series_equal(result, expected)


Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/arrays/string_/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ def test_arrow_roundtrip():
def test_value_counts_na():
arr = pd.array(["a", "b", "a", pd.NA], dtype="string")
result = arr.value_counts(dropna=False)
expected = pd.Series([2, 1, 1], index=["a", "b", pd.NA], dtype="Int64")
expected = pd.Series([2, 1, 1], index=["a", pd.NA, "b"], dtype="Int64")
tm.assert_series_equal(result, expected)

result = arr.value_counts(dropna=True)
Expand Down
12 changes: 8 additions & 4 deletions pandas/tests/base/test_value_counts.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,16 +153,16 @@ def test_value_counts_bins(index_or_series):
# these return the same
res4 = s1.value_counts(bins=4, dropna=True)
intervals = IntervalIndex.from_breaks([0.997, 1.5, 2.0, 2.5, 3.0])
exp4 = Series([2, 1, 1, 0], index=intervals.take([0, 3, 1, 2]))
exp4 = Series([2, 1, 1, 0], index=intervals.take([0, 1, 3, 2]))
tm.assert_series_equal(res4, exp4)

res4 = s1.value_counts(bins=4, dropna=False)
intervals = IntervalIndex.from_breaks([0.997, 1.5, 2.0, 2.5, 3.0])
exp4 = Series([2, 1, 1, 0], index=intervals.take([0, 3, 1, 2]))
exp4 = Series([2, 1, 1, 0], index=intervals.take([0, 1, 3, 2]))
tm.assert_series_equal(res4, exp4)

res4n = s1.value_counts(bins=4, normalize=True)
exp4n = Series([0.5, 0.25, 0.25, 0], index=intervals.take([0, 3, 1, 2]))
exp4n = Series([0.5, 0.25, 0.25, 0], index=intervals.take([0, 1, 3, 2]))
tm.assert_series_equal(res4n, exp4n)

# handle NA's properly
Expand Down Expand Up @@ -239,7 +239,11 @@ def test_value_counts_datetime64(index_or_series):
tm.assert_series_equal(result, expected_s)

result = s.value_counts(dropna=False)
expected_s[pd.NaT] = 1
# GH 35922. NaN-like now sorts to the beginning of duplicate counts
idx = pd.to_datetime(
["2010-01-01 00:00:00", "2008-09-09 00:00:00", pd.NaT, "2009-01-01 00:00:00"]
)
expected_s = Series([3, 2, 1, 1], index=idx)
tm.assert_series_equal(result, expected_s)

unique = s.unique()
Expand Down
6 changes: 5 additions & 1 deletion pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,11 @@ def test_sort_values(self, data_for_sorting, ascending, sort_by_key):
result = ser.sort_values(ascending=ascending, key=sort_by_key)
expected = ser.iloc[[2, 0, 1]]
if not ascending:
expected = expected[::-1]
# GH 35922. Expect stable sort
if ser.nunique() == 2:
expected = ser.iloc[[0, 1, 2]]
else:
expected = ser.iloc[[1, 0, 2]]
Copy link
Member Author

@AlexKirko AlexKirko Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of our parametrizations in data_for_sorting contain duplicates and some don't. Since they are all structured the same way, we can detect different expected sort orders by checking the number of unique values. If it looks ugly to you, we could split data_for_sorting for this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is fine


self.assert_series_equal(result, expected)

Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/frame/methods/test_describe.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def test_describe_bool_frame(self):
)
result = df.describe()
expected = DataFrame(
{"bool_data_1": [4, 2, True, 2], "bool_data_2": [4, 2, True, 3]},
{"bool_data_1": [4, 2, False, 2], "bool_data_2": [4, 2, True, 3]},
index=["count", "unique", "top", "freq"],
)
tm.assert_frame_equal(result, expected)
Expand All @@ -79,7 +79,7 @@ def test_describe_bool_frame(self):
)
result = df.describe()
expected = DataFrame(
{"bool_data": [4, 2, True, 2], "str_data": [4, 3, "a", 2]},
{"bool_data": [4, 2, False, 2], "str_data": [4, 3, "a", 2]},
index=["count", "unique", "top", "freq"],
)
tm.assert_frame_equal(result, expected)
Expand Down
6 changes: 3 additions & 3 deletions pandas/tests/frame/methods/test_value_counts.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def test_data_frame_value_counts_default():
expected = pd.Series(
data=[2, 1, 1],
index=pd.MultiIndex.from_arrays(
[(4, 6, 2), (0, 0, 2)], names=["num_legs", "num_wings"]
[(4, 2, 6), (0, 2, 0)], names=["num_legs", "num_wings"]
),
)

Expand All @@ -65,7 +65,7 @@ def test_data_frame_value_counts_normalize():
expected = pd.Series(
data=[0.5, 0.25, 0.25],
index=pd.MultiIndex.from_arrays(
[(4, 6, 2), (0, 0, 2)], names=["num_legs", "num_wings"]
[(4, 2, 6), (0, 2, 0)], names=["num_legs", "num_wings"]
),
)

Expand All @@ -78,7 +78,7 @@ def test_data_frame_value_counts_single_col_default():
result = df.value_counts()
expected = pd.Series(
data=[2, 1, 1],
index=pd.MultiIndex.from_arrays([[4, 6, 2]], names=["num_legs"]),
index=pd.MultiIndex.from_arrays([[4, 2, 6]], names=["num_legs"]),
)

tm.assert_series_equal(result, expected)
Expand Down
6 changes: 3 additions & 3 deletions pandas/tests/indexes/datetimes/test_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,15 +231,15 @@ def test_order_without_freq(self, index_dates, expected_dates, tz_naive_fixture)
index = DatetimeIndex(index_dates, tz=tz, name="idx")
expected = DatetimeIndex(expected_dates, tz=tz, name="idx")

ordered = index.sort_values()
ordered = index.sort_values(na_position="first")
Copy link
Member Author

@AlexKirko AlexKirko Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaTs used to be sorted to the beginning ignoring na_position (because their numerical representation is -big_number) - this was an ancillary bug. Now they get sorted to the end by default, and the user needs to supply na_position to get previous behavior.

tm.assert_index_equal(ordered, expected)
assert ordered.freq is None

ordered = index.sort_values(ascending=False)
tm.assert_index_equal(ordered, expected[::-1])
assert ordered.freq is None

ordered, indexer = index.sort_values(return_indexer=True)
ordered, indexer = index.sort_values(return_indexer=True, na_position="first")
tm.assert_index_equal(ordered, expected)

exp = np.array([0, 4, 3, 1, 2])
Expand All @@ -249,7 +249,7 @@ def test_order_without_freq(self, index_dates, expected_dates, tz_naive_fixture)
ordered, indexer = index.sort_values(return_indexer=True, ascending=False)
tm.assert_index_equal(ordered, expected[::-1])

exp = np.array([2, 1, 3, 4, 0])
exp = np.array([2, 1, 3, 0, 4])
tm.assert_numpy_array_equal(indexer, exp, check_dtype=False)
assert ordered.freq is None

Expand Down
Loading