Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix isin with nans and large arrays #36266

Merged
merged 7 commits into from
Sep 19, 2020

Conversation

Hanspagh
Copy link
Contributor

@Hanspagh Hanspagh commented Sep 10, 2020

Does a np.isnan if nan is given to isin and we have a large enough array to trigger the np.in1d path

@Hanspagh
Copy link
Contributor Author

Seems tests are failing, but it seems unrelated to this?

pandas/core/algorithms.py Outdated Show resolved Hide resolved
@dsaxton
Copy link
Member

dsaxton commented Sep 11, 2020

Seems tests are failing, but it seems unrelated to this?

Yes, those are unrelated

@dsaxton dsaxton added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 11, 2020
@dsaxton dsaxton changed the title fix isin with nans and large arrays BUG: fix isin with nans and large arrays Sep 11, 2020
@dsaxton
Copy link
Member

dsaxton commented Sep 11, 2020

Thanks @Hanspagh, can you add a release note for 1.1.3?

@Hanspagh
Copy link
Contributor Author

Hanspagh commented Sep 11, 2020

Changes as requested, let mere know if there is anything else needed

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls also merge master, ping when green.

pandas/tests/test_algos.py Show resolved Hide resolved
@jreback jreback added this to the 1.1.3 milestone Sep 12, 2020
@Hanspagh Hanspagh force-pushed the fix-isin-with-nan-and-large-array branch 2 times, most recently from 0ac189c to b31f4e2 Compare September 14, 2020 07:17
@Hanspagh
Copy link
Contributor Author

Updated as requested

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, ping on green.

doc/source/whatsnew/v1.1.3.rst Outdated Show resolved Hide resolved
pandas/tests/test_algos.py Outdated Show resolved Hide resolved
pandas/tests/test_algos.py Outdated Show resolved Hide resolved
@Hanspagh Hanspagh force-pushed the fix-isin-with-nan-and-large-array branch from 49aeb2d to 7f3d217 Compare September 16, 2020 11:41
@Hanspagh
Copy link
Contributor Author

Fixed

@Hanspagh Hanspagh force-pushed the fix-isin-with-nan-and-large-array branch from 7f3d217 to 3679c14 Compare September 17, 2020 07:26
@Hanspagh
Copy link
Contributor Author

Done

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Hanspagh minor nit re consistency of issue number comments

pandas/tests/test_algos.py Outdated Show resolved Hide resolved
pandas/tests/test_algos.py Outdated Show resolved Hide resolved
Hanspagh and others added 2 commits September 17, 2020 19:10
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
@Hanspagh
Copy link
Contributor Author

I can do a rebase if needed?

@simonjayhawkins
Copy link
Member

I can do a rebase if needed?

git pull will normally get the changes from the commit suggestion locally.

to update the PR with the latest changes on master see https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#updating-your-pull-request

@jreback jreback merged commit aed64e8 into pandas-dev:master Sep 19, 2020
@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

thanks @Hanspagh

@lumberbot-app
Copy link

lumberbot-app bot commented Sep 19, 2020

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
$ git checkout 1.1.x
$ git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
$ git cherry-pick -m1 aed64e85eb17edb0e55013868b1aa4e44e977a36
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
$ git commit -am 'Backport PR #36266: BUG: fix isin with nans and large arrays'
  1. Push to a named branch :
git push YOURFORK 1.1.x:auto-backport-of-pr-36266-on-1.1.x
  1. Create a PR against branch 1.1.x, I would have named this PR:

"Backport PR #36266 on branch 1.1.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

@simonjayhawkins
Copy link
Member

#36385 (comment) could have been responsible for not being able to auto backport

@@ -440,7 +440,12 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
# GH16012
# Ensure np.in1d doesn't get object types or it *may* throw an exception
if len(comps) > 1_000_000 and not is_object_dtype(comps):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry a little bit too late. The whole point of doing this for len(comps) > 1_000_000, was that numpy was deemed to be faster (which is probably no loner the case btw, see #22205 (comment)), adding any, isnan, logical_or on top (with all the cache misses and temporary objects) will make this branch much slower. So probably it is best just to drop the whole branch and always keep f = htable.ismember_object (unless it is is_integer_dtype of cause).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u run the asvs and check here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I have opened RP #36611 with my suggestion and some benchmarks, which show that numpy's in1d is only faster when here are very few unique values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent handling of nan-float64 in Series.isin()
5 participants