BUG: fix isin with nans and large arrays #36266

Hanspagh · 2020-09-10T11:40:06Z

Does a np.isnan if nan is given to isin and we have a large enough array to trigger the np.in1d path

closes Inconsistent handling of nan-float64 in Series.isin() #22205
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Hanspagh · 2020-09-10T13:28:33Z

Seems tests are failing, but it seems unrelated to this?

pandas/core/algorithms.py

dsaxton · 2020-09-11T02:05:16Z

Seems tests are failing, but it seems unrelated to this?

Yes, those are unrelated

dsaxton · 2020-09-11T02:09:00Z

Thanks @Hanspagh, can you add a release note for 1.1.3?

Hanspagh · 2020-09-11T07:15:45Z

Changes as requested, let mere know if there is anything else needed

pandas/tests/test_algos.py

doc/source/whatsnew/v1.1.3.rst

jreback

pls also merge master, ping when green.

pandas/tests/test_algos.py

Hanspagh · 2020-09-14T07:17:47Z

Updated as requested

jreback

minor comments, ping on green.

doc/source/whatsnew/v1.1.3.rst

pandas/tests/test_algos.py

Hanspagh · 2020-09-16T11:42:34Z

Fixed

doc/source/whatsnew/v1.1.3.rst

Hanspagh · 2020-09-17T07:26:34Z

Done

simonjayhawkins

Thanks @Hanspagh minor nit re consistency of issue number comments

pandas/tests/test_algos.py

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

Hanspagh · 2020-09-17T17:11:11Z

I can do a rebase if needed?

simonjayhawkins · 2020-09-17T17:43:17Z

I can do a rebase if needed?

git pull will normally get the changes from the commit suggestion locally.

to update the PR with the latest changes on master see https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#updating-your-pull-request

jreback · 2020-09-19T02:14:29Z

thanks @Hanspagh

lumberbot-app · 2020-09-19T02:14:58Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.1.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 aed64e85eb17edb0e55013868b1aa4e44e977a36

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #36266: BUG: fix isin with nans and large arrays'

Push to a named branch :

git push YOURFORK 1.1.x:auto-backport-of-pr-36266-on-1.1.x

Create a PR against branch 1.1.x, I would have named this PR:

"Backport PR #36266 on branch 1.1.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

simonjayhawkins · 2020-09-19T09:53:26Z

#36385 (comment) could have been responsible for not being able to auto backport

Co-authored-by: Hans <hanspagh@gmail.com>

realead · 2020-09-21T07:25:46Z

pandas/core/algorithms.py

@@ -440,7 +440,12 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
    # GH16012
    # Ensure np.in1d doesn't get object types or it *may* throw an exception
    if len(comps) > 1_000_000 and not is_object_dtype(comps):


Sorry a little bit too late. The whole point of doing this for len(comps) > 1_000_000, was that numpy was deemed to be faster (which is probably no loner the case btw, see #22205 (comment)), adding any, isnan, logical_or on top (with all the cache misses and temporary objects) will make this branch much slower. So probably it is best just to drop the whole branch and always keep f = htable.ismember_object (unless it is is_integer_dtype of cause).

can u run the asvs and check here?

@jreback I have opened RP #36611 with my suggestion and some benchmarks, which show that numpy's in1d is only faster when here are very few unique values.

dsaxton reviewed Sep 11, 2020

View reviewed changes

pandas/core/algorithms.py Outdated Show resolved Hide resolved

dsaxton added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 11, 2020

dsaxton changed the title ~~fix isin with nans and large arrays~~ BUG: fix isin with nans and large arrays Sep 11, 2020

dsaxton reviewed Sep 11, 2020

View reviewed changes

pandas/tests/test_algos.py Show resolved Hide resolved

dsaxton reviewed Sep 11, 2020

View reviewed changes

doc/source/whatsnew/v1.1.3.rst Outdated Show resolved Hide resolved

jreback requested changes Sep 12, 2020

View reviewed changes

pandas/tests/test_algos.py Show resolved Hide resolved

jreback added this to the 1.1.3 milestone Sep 12, 2020

Hanspagh force-pushed the fix-isin-with-nan-and-large-array branch 2 times, most recently from 0ac189c to b31f4e2 Compare September 14, 2020 07:17

jreback requested changes Sep 15, 2020

View reviewed changes

doc/source/whatsnew/v1.1.3.rst Outdated Show resolved Hide resolved

pandas/tests/test_algos.py Outdated Show resolved Hide resolved

pandas/tests/test_algos.py Outdated Show resolved Hide resolved

Hanspagh added 4 commits September 16, 2020 13:41

fix isin with nans and large arrays

656b6b4

use .any() instead of any() + whatsnew entry

246cab5

test series.isin

25d48c0

update whats new

859cbf6

Hanspagh force-pushed the fix-isin-with-nan-and-large-array branch from 49aeb2d to 7f3d217 Compare September 16, 2020 11:41

dsaxton reviewed Sep 16, 2020

View reviewed changes

doc/source/whatsnew/v1.1.3.rst Outdated Show resolved Hide resolved

docs

3679c14

Hanspagh force-pushed the fix-isin-with-nan-and-large-array branch from 7f3d217 to 3679c14 Compare September 17, 2020 07:26

simonjayhawkins reviewed Sep 17, 2020

View reviewed changes

pandas/tests/test_algos.py Outdated Show resolved Hide resolved

pandas/tests/test_algos.py Outdated Show resolved Hide resolved

Hanspagh and others added 2 commits September 17, 2020 19:10

Update pandas/tests/test_algos.py

4e4359b

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

Update pandas/tests/test_algos.py

53ab240

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

jreback approved these changes Sep 19, 2020

View reviewed changes

jreback merged commit aed64e8 into pandas-dev:master Sep 19, 2020

lumberbot-app bot added the Still Needs Manual Backport label Sep 19, 2020

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Sep 19, 2020

Backport PR pandas-dev#36266:: BUG: fix isin with nans and large arrays

fc4df5d

simonjayhawkins mentioned this pull request Sep 19, 2020

Backport PR #36266:: BUG: fix isin with nans and large arrays #36474

Merged

simonjayhawkins removed the Still Needs Manual Backport label Sep 19, 2020

simonjayhawkins added a commit that referenced this pull request Sep 19, 2020

Backport PR #36266:: BUG: fix isin with nans and large arrays (#36474)

1aba960

Co-authored-by: Hans <hanspagh@gmail.com>

realead reviewed Sep 21, 2020

View reviewed changes

asishm mentioned this pull request Oct 13, 2020

BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows #37094

Closed

3 tasks

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

BUG: fix isin with nans and large arrays (pandas-dev#36266)

8523c68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix isin with nans and large arrays #36266

BUG: fix isin with nans and large arrays #36266

Hanspagh commented Sep 10, 2020 •

edited

Loading

Hanspagh commented Sep 10, 2020

dsaxton commented Sep 11, 2020

dsaxton commented Sep 11, 2020

Hanspagh commented Sep 11, 2020 •

edited

Loading

jreback left a comment

Hanspagh commented Sep 14, 2020

jreback left a comment

Hanspagh commented Sep 16, 2020

Hanspagh commented Sep 17, 2020

simonjayhawkins left a comment

Hanspagh commented Sep 17, 2020

simonjayhawkins commented Sep 17, 2020

jreback commented Sep 19, 2020

lumberbot-app bot commented Sep 19, 2020

simonjayhawkins commented Sep 19, 2020

realead Sep 21, 2020

jreback Sep 21, 2020

realead Sep 24, 2020

BUG: fix isin with nans and large arrays #36266

BUG: fix isin with nans and large arrays #36266

Conversation

Hanspagh commented Sep 10, 2020 • edited Loading

Hanspagh commented Sep 10, 2020

dsaxton commented Sep 11, 2020

dsaxton commented Sep 11, 2020

Hanspagh commented Sep 11, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Hanspagh commented Sep 14, 2020

jreback left a comment

Choose a reason for hiding this comment

Hanspagh commented Sep 16, 2020

Hanspagh commented Sep 17, 2020

simonjayhawkins left a comment

Choose a reason for hiding this comment

Hanspagh commented Sep 17, 2020

simonjayhawkins commented Sep 17, 2020

jreback commented Sep 19, 2020

lumberbot-app bot commented Sep 19, 2020

simonjayhawkins commented Sep 19, 2020

realead Sep 21, 2020

Choose a reason for hiding this comment

jreback Sep 21, 2020

Choose a reason for hiding this comment

realead Sep 24, 2020

Choose a reason for hiding this comment

Hanspagh commented Sep 10, 2020 •

edited

Loading

Hanspagh commented Sep 11, 2020 •

edited

Loading