Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix isin with nans and large arrays #36266

Merged
merged 7 commits into from
Sep 19, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Bug fixes
- Bug in :meth:`Series.str.startswith` and :meth:`Series.str.endswith` with ``category`` dtype not propagating ``na`` parameter (:issue:`36241`)
- Bug in :class:`Series` constructor where integer overflow would occur for sufficiently large scalar inputs when an index was provided (:issue:`36291`)
- Bug in :meth:`DataFrame.stack` raising a ``ValueError`` when stacking :class:`MultiIndex` columns based on position when the levels had duplicate names (:issue:`36353`)
- Bug in :meth:`Series.isin` and :meth:`DataFrame.isin` when using ``NaN`` and a row length above 1,000,000 (:issue:`22205`)

.. ---------------------------------------------------------------------------

Expand Down
7 changes: 6 additions & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,12 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
# GH16012
# Ensure np.in1d doesn't get object types or it *may* throw an exception
if len(comps) > 1_000_000 and not is_object_dtype(comps):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry a little bit too late. The whole point of doing this for len(comps) > 1_000_000, was that numpy was deemed to be faster (which is probably no loner the case btw, see #22205 (comment)), adding any, isnan, logical_or on top (with all the cache misses and temporary objects) will make this branch much slower. So probably it is best just to drop the whole branch and always keep f = htable.ismember_object (unless it is is_integer_dtype of cause).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u run the asvs and check here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I have opened RP #36611 with my suggestion and some benchmarks, which show that numpy's in1d is only faster when here are very few unique values.

f = np.in1d
# If the the values include nan we need to check for nan explicitly
# since np.nan it not equal to np.nan
if np.isnan(values).any():
f = lambda c, v: np.logical_or(np.in1d(c, v), np.isnan(c))
else:
f = np.in1d
elif is_integer_dtype(comps):
try:
values = values.astype("int64", copy=False)
Expand Down
18 changes: 17 additions & 1 deletion pandas/tests/test_algos.py
Original file line number Diff line number Diff line change
Expand Up @@ -801,7 +801,6 @@ def test_i8(self):
tm.assert_numpy_array_equal(result, expected)

def test_large(self):

s = pd.date_range("20000101", periods=2000000, freq="s").values
result = algos.isin(s, s[0:2])
expected = np.zeros(len(s), dtype=bool)
Expand Down Expand Up @@ -841,6 +840,23 @@ def test_same_nan_is_in(self):
result = algos.isin(comps, values)
tm.assert_numpy_array_equal(expected, result)

def test_same_nan_is_in_large(self):
Hanspagh marked this conversation as resolved.
Show resolved Hide resolved
# issue:`22205`
Hanspagh marked this conversation as resolved.
Show resolved Hide resolved
s = np.tile(1.0, 1_000_001)
Hanspagh marked this conversation as resolved.
Show resolved Hide resolved
s[0] = np.nan
result = algos.isin(s, [np.nan, 1])
expected = np.ones(len(s), dtype=bool)
tm.assert_numpy_array_equal(result, expected)

def test_same_nan_is_in_large_series(self):
# issue:`#25395`
Hanspagh marked this conversation as resolved.
Show resolved Hide resolved
s = np.tile(1.0, 1_000_001)
series = pd.Series(s)
s[0] = np.nan
result = series.isin([np.nan, 1])
expected = pd.Series(np.ones(len(s), dtype=bool))
tm.assert_series_equal(result, expected)

def test_same_object_is_in(self):
# GH 22160
# there could be special treatment for nans
Expand Down