Inconsistent handling of nan-float64 in Series.isin() #22205

realead · 2018-08-05T12:32:45Z

Code Sample, a copy-pastable example if possible

N=10**6
s=pd.Series(np.full(N, np.nan, dtype=np.float64), copy=False)
s.isin([np.nan]).all()

results in True, however

s=pd.Series(np.full(N+1, np.nan, dtype=np.float64), copy=False)
s.isin([np.nan]).all()

results in False, even more s.isin([np.nan]).any() results in False

Problem description

Obviously, the result should be the same in both cases.

Expected Output

IMO True is more consistent with the behavior of Pandas in other functions (such as pd.unique()).

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.28.3
numpy: 1.13.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

realead · 2018-08-05T12:40:33Z

The problem at core of this issue, is that pandas' unique and numpy's unique have different notions about nan's in the case of float64:

>>> pd.unique([np.nan, np.nan])
array([ nan])
>>> np.unique([np.nan, np.nan])
array([ nan,  nan])

For N>10^6, pandas' notion of unique is switched to numpy's notion of unique:

    f = lambda x, y: htable.ismember_object(x, values)
    # GH16012
    # Ensure np.in1d doesn't get object types or it *may* throw an exception
    if len(comps) > 1000000 and not is_object_dtype(comps):
          f = lambda x, y: np.in1d(x, y)

Here the whole code

realead · 2018-08-05T13:12:31Z

It seems, as if np.in1d is used because it is faster.

I don't understand the the 'in1d- algorithm good enough, to tell what is going on, but here are my benchmark and timings:

port pandas.core.algorithms as algos
uses_pandas = np.arange(10**6)
uses_numpy = np.arange(10**6+1)
small = np.arange(2)
small2 = np.arange(10)
large = np.arange(10**7)
%timeit algos.isin(uses_pandas, small)
%timeit algos.isin(uses_numpy, small)
%timeit algos.isin(uses_pandas, small2)
%timeit algos.isin(uses_numpy, small2)
%timeit algos.isin(uses_pandas, large)
%timeit algos.isin(uses_numpy, large)

results in

                pandas vs numpy
small          4.4ms      2.4ms
small2        5.5ms     10.9ms
large          277 ms 699 ms

To me it seems, as if for large inputs of look-up-values, the hash-map-approach could be the better one.

Actually, there is an optimization if the size of the second array is very small:

    if len(ar2) < 10 * len(ar1) ** 0.145 or contains_object:
        if invert:
            mask = np.ones(len(ar1), dtype=bool)
            for a in ar2:
                mask &= (ar1 != a)
        else:
            mask = np.zeros(len(ar1), dtype=bool)
            for a in ar2:
                mask |= (ar1 == a)
      return mask

which leads to O(n*m) behavior (n.m - sizes of input arrays)for small sizes m of the second array. This linear behavior can also be seen in the timing.

The running times of algorithms are

pandas: O(n+m) if look-up can be assumed to be O(1)
np.in1d: O(n*log(n)+m*log(m) +(N+M)log(N+M)) if there are N/M unique elements in the arrays

So it looks like np.in1d is only faster if there are less than 5 elements in the second array.

jbrockmendel added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 6, 2018

vmuriart mentioned this issue Mar 22, 2019

pd.Series.isin has different behaviour based on number of rows #25395

Closed

jreback added this to the Contributions Welcome milestone Sep 4, 2020

jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Sep 4, 2020

Hanspagh mentioned this issue Sep 10, 2020

BUG: fix isin with nans and large arrays #36266

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.1.3 Sep 12, 2020

jreback closed this as completed in #36266 Sep 19, 2020

realead mentioned this issue Sep 24, 2020

PERF: Always using panda's hashtable approach, dropping np.in1d #36611

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent handling of nan-float64 in Series.isin() #22205

Inconsistent handling of nan-float64 in Series.isin() #22205

realead commented Aug 5, 2018 •

edited

Loading

INSTALLED VERSIONS

realead commented Aug 5, 2018

realead commented Aug 5, 2018 •

edited

Loading

Inconsistent handling of nan-float64 in Series.isin() #22205

Inconsistent handling of nan-float64 in Series.isin() #22205

Comments

realead commented Aug 5, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

realead commented Aug 5, 2018

realead commented Aug 5, 2018 • edited Loading

realead commented Aug 5, 2018 •

edited

Loading

Output of `pd.show_versions()`

realead commented Aug 5, 2018 •

edited

Loading