Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent handling of nan-float64 in Series.isin() #22205

Closed
realead opened this issue Aug 5, 2018 · 2 comments · Fixed by #36266
Closed

Inconsistent handling of nan-float64 in Series.isin() #22205

realead opened this issue Aug 5, 2018 · 2 comments · Fixed by #36266
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@realead
Copy link
Contributor

realead commented Aug 5, 2018

Code Sample, a copy-pastable example if possible

N=10**6
s=pd.Series(np.full(N, np.nan, dtype=np.float64), copy=False)
s.isin([np.nan]).all()

results in True, however

s=pd.Series(np.full(N+1, np.nan, dtype=np.float64), copy=False)
s.isin([np.nan]).all()

results in False, even more s.isin([np.nan]).any() results in False

Problem description

Obviously, the result should be the same in both cases.

Expected Output

IMO True is more consistent with the behavior of Pandas in other functions (such as pd.unique()).

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.28.3
numpy: 1.13.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@realead
Copy link
Contributor Author

realead commented Aug 5, 2018

The problem at core of this issue, is that pandas' unique and numpy's unique have different notions about nan's in the case of float64:

>>> pd.unique([np.nan, np.nan])
array([ nan])
>>> np.unique([np.nan, np.nan])
array([ nan,  nan])

For N>10^6, pandas' notion of unique is switched to numpy's notion of unique:

    f = lambda x, y: htable.ismember_object(x, values)
    # GH16012
    # Ensure np.in1d doesn't get object types or it *may* throw an exception
    if len(comps) > 1000000 and not is_object_dtype(comps):
          f = lambda x, y: np.in1d(x, y)

Here the whole code

@realead
Copy link
Contributor Author

realead commented Aug 5, 2018

It seems, as if np.in1d is used because it is faster.

I don't understand the the 'in1d- algorithm good enough, to tell what is going on, but here are my benchmark and timings:

port pandas.core.algorithms as algos
uses_pandas = np.arange(10**6)
uses_numpy = np.arange(10**6+1)
small = np.arange(2)
small2 = np.arange(10)
large = np.arange(10**7)
%timeit algos.isin(uses_pandas, small)
%timeit algos.isin(uses_numpy, small)
%timeit algos.isin(uses_pandas, small2)
%timeit algos.isin(uses_numpy, small2)
%timeit algos.isin(uses_pandas, large)
%timeit algos.isin(uses_numpy, large)

results in

                pandas vs numpy
small          4.4ms      2.4ms
small2        5.5ms     10.9ms
large          277 ms 699 ms

To me it seems, as if for large inputs of look-up-values, the hash-map-approach could be the better one.

Actually, there is an optimization if the size of the second array is very small:

    if len(ar2) < 10 * len(ar1) ** 0.145 or contains_object:
        if invert:
            mask = np.ones(len(ar1), dtype=bool)
            for a in ar2:
                mask &= (ar1 != a)
        else:
            mask = np.zeros(len(ar1), dtype=bool)
            for a in ar2:
                mask |= (ar1 == a)
      return mask

which leads to O(n*m) behavior (n.m - sizes of input arrays)for small sizes m of the second array. This linear behavior can also be seen in the timing.

The running times of algorithms are

pandas: O(n+m) if look-up can be assumed to be O(1)
np.in1d: O(n*log(n)+m*log(m) +(N+M)log(N+M)) if there are N/M unique elements in the arrays

So it looks like np.in1d is only faster if there are less than 5 elements in the second array.

@jbrockmendel jbrockmendel added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 6, 2018
@jreback jreback added this to the Contributions Welcome milestone Sep 4, 2020
@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Sep 4, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.1.3 Sep 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants