PERF: Faster SparseArray.getitem for boolean masks (#23122) #44955

bdrum · 2021-12-18T11:35:19Z

closes Efficiency of SparseArray.__getitem__(SparseArray[bool]) #23122
tests added / passed
Ensure all linting tests pass, see here for how to run them
The issue Efficiency of SparseArray.__getitem__(SparseArray[bool]) #23122 contains recommendation from @TomAugspurger to avoid densifying when boolean sparse array is used as index. I have tried few solutions including usage direct sp_index.indices in a case of fill_value is False and np.setdiff1d for opposite case.
Usage of direct indexing is working very fast, but calculating setdiff1d is working very slow, the good solution for both case (much faster than setdiff1d and bit slower than direct indices) using masks. Here is the plot of comparison old (densifying) implementation and new one (masks):

I understand that using mask will take more memory, but I don't think that it is crucial point here especially for bool8 type, but let's discuss.
We can combine both approaches and use direct indices when fill value is False and use mask in the opposite case.
One more question to discuss related with case when True value of Boolean sparse array lead to the fill value of original array. I assume that in this case fill value should be return.

I've got benchmark results via pytest-benchmark framework.

Code for benchmarking:

import pytest
from pandas.arrays import SparseArray
import numpy as np


class TestPerf:
    def setup_method(self, method):
        d = 0.8
        self.n = 1_000_000
        self.data = np.random.rand(self.n)
        genInd = lambda d: np.unique(
            np.random.randint(
                low=0, high=self.n - 1, size=int(self.n * d), dtype=np.int32
            )
        )
        self.inds = genInd(d)
        barrs = np.full(shape=self.n, fill_value=True, dtype=np.bool8)
        arr_dens_inds = genInd(d)

        self.data[self.inds] = np.nan
        barrs[arr_dens_inds] = False

        self.sp_arrs = SparseArray(self.data)
        self.sp_barrs_false = SparseArray(barrs, dtype=np.bool8, fill_value=False)
        self.sp_barrs_true = SparseArray(~barrs[d], dtype=np.bool8, fill_value=True)

    def perf_sparse_boolean_indexing_fv_true(self):
        return self.sp_arrs[self.sp_barrs_true]

    def perf_sparse_boolean_indexing_fv_false(self):
        return self.sp_arrs[self.sp_barrs_false]

    def test_perf_sparse_boolean_indexing_fv_true(self, benchmark):
        benchmark(self.perf_sparse_boolean_indexing_fv_true)

    def test_perf_sparse_boolean_indexing_fv_false(self, benchmark):
        benchmark(self.perf_sparse_boolean_indexing_fv_false)

pandas/core/arrays/sparse/array.py

jreback

can you add a whatsnew note for the perf improvement. is there a bug fix here too?
do we have asv's that cover these cases?

pandas/tests/extension/test_sparse.py

pandas/tests/arrays/sparse/test_array.py

bdrum · 2021-12-20T14:59:48Z

Yes, here also the bugfix, I needed to fix them because some test was failed. Seems it would be better to create separate feature.
It related to incorrect version of isna:

# before fix
s = pd.arrays.SparseArray([0,2,3,4,0,0,0],fill_value=0)
s.isna()
#[False, False, False, False, False, False, False]
#Fill: False
#IntIndex
#Indices: array([1, 2, 3])

#after fix
s
#[0, 2, 3, 4, 0, 0, 0]
#Fill: 0
#IntIndex
#Indices: array([1, 2, 3])

s.isna()

#[False, False, False, False, False, False, False]
#Fill: False
#IntIndex
#Indices: array([], dtype=int32)

About asv, no, sorry I've missed it in contribution guides (btw as well as result variable in tests...)
I will try to add asv benchmarks.

jreback · 2021-12-20T15:11:45Z

@bdrum fine on the bugfix, just add a whatsnew note describing (and you can reference this PR number)

bdrum · 2021-12-21T10:51:53Z

@bdrum fine on the bugfix, just add a whatsnew note describing (and you can reference this PR number)

@jreback Thanks!
Also I've made a changes again according to discussion with @jbrockmendel
And fixed bugs with indexes after applying unary operations.

What about increasing performance. I've added benchmark to asv, but looks strange

       before           after         ratio
     [47eb2198]       [1bb10c75]
     <master>         <dev23122>
+      22.3±0.5ms         44.0±4ms     1.97  sparse.GetItemMask.time_mask(True)
-     1.62±0.05ms         48.9±8μs     0.03  sparse.GetItemMask.time_mask(False)

Because when I ran the same code via timeit, I've got huge performance increasing for False fill_value and a bit (about 5-7%) when fill_value is True.

bdrum · 2021-12-21T19:45:46Z

Sorry, a bit confusion after merging conflicts...

Now I have 26 failed test in my branch as well as in current master:

=============================================================== short test summary info =============================================================== FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other6] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-DataFrame-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-array-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-array-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-array-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_mixed_invalid[numexpr-tzlocal()] - OSError: [... FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Index-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Index-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Index-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other6] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-DataFrame-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-array-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-array-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-array-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_mixed_invalid[python-tzlocal()] - OSError: [E... FAILED pandas/tests/indexing/test_chaining_and_caching.py::TestChaining::test_detect_chained_assignment_warning_stacklevel[3] - AssertionError: asser... FAILED pandas/tests/indexing/test_chaining_and_caching.py::TestChaining::test_detect_chained_assignment_warning_stacklevel[rhs1] - AssertionError: as... ====================================== 26 failed, 4 passed, 2 skipped, 151259 deselected, 29 warnings in 52.32s =======================================

jbrockmendel · 2021-12-21T22:47:41Z

Can you post the traceback for one of the arithmetic tests that failed?

bdrum · 2021-12-22T07:49:52Z

Can you post the traceback for one of the arithmetic tests that failed?

Sure. Here it is:

FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other3]

<?xml version="1.0" encoding="utf-8"?><testsuites><testsuite name="pytest" errors="0" failures="1" skipped="0" tests="1" time="0.974" timestamp="2021-12-22T10:48:13.621099" hostname="RUMLAB"><testcase classname="pandas.tests.arithmetic.test_datetime64.TestDatetime64ArrayLikeComparisons" name="test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other3]" time="0.012"><failure message="OSError: [Errno 22] Invalid argument">self = &lt;pandas.tests.arithmetic.test_datetime64.TestDatetime64ArrayLikeComparisons object at 0x000001F5FE6B3790&gt;
other = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
tz_naive_fixture = tzlocal()
box_with_array = &lt;class 'pandas.core.indexes.base.Index'&gt;

    @pytest.mark.parametrize(
        "other",
        [
            # GH#4968 invalid date/int comparisons
            list(range(10)),
            np.arange(10),
            np.arange(10).astype(np.float32),
            np.arange(10).astype(object),
            pd.timedelta_range("1ns", periods=10).array,
            np.array(pd.timedelta_range("1ns", periods=10)),
            list(pd.timedelta_range("1ns", periods=10)),
            pd.timedelta_range("1 Day", periods=10).astype(object),
            pd.period_range("1971-01-01", freq="D", periods=10).array,
            pd.period_range("1971-01-01", freq="D", periods=10).astype(object),
        ],
    )
    def test_dt64arr_cmp_arraylike_invalid(
        self, other, tz_naive_fixture, box_with_array
    ):
        tz = tz_naive_fixture
    
        dta = date_range("1970-01-01", freq="ns", periods=10, tz=tz)._data
        obj = tm.box_expected(dta, box_with_array)
&gt;       assert_invalid_comparison(obj, other, box_with_array)

pandas\tests\arithmetic\test_datetime64.py:122: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\tests\arithmetic\common.py:111: in assert_invalid_comparison
    result = xbox2(left == right)
pandas\core\ops\common.py:70: in new_method
    return method(self, other)
pandas\core\arraylike.py:40: in __eq__
    return self._cmp_method(other, operator.eq)
pandas\core\indexes\base.py:6591: in _cmp_method
    result = op(self._values, other)
pandas\core\ops\common.py:70: in new_method
    return method(self, other)
pandas\core\arraylike.py:40: in __eq__
    return self._cmp_method(other, operator.eq)
pandas\core\arrays\datetimelike.py:1024: in _cmp_method
    op, np.asarray(self.astype(object)), other
pandas\core\arrays\datetimes.py:666: in astype
    return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy)
pandas\core\arrays\datetimelike.py:413: in astype
    converted = ints_to_pydatetime(
pandas\_libs\tslibs\vectorized.pyx:177: in pandas._libs.tslibs.vectorized.ints_to_pydatetime
    local_value = tz_convert_utc_to_tzlocal(value, tz)
pandas\_libs\tslibs\tzconversion.pyx:383: in pandas._libs.tslibs.tzconversion.tz_convert_utc_to_tzlocal
    return _tz_convert_tzlocal_utc(utc_val, tz, to_utc=False, fold=fold)
pandas\_libs\tslibs\tzconversion.pyx:594: in pandas._libs.tslibs.tzconversion._tz_convert_tzlocal_utc
    delta = _tzlocal_get_offset_components(val, tz, to_utc, fold)
pandas\_libs\tslibs\tzconversion.pyx:554: in pandas._libs.tslibs.tzconversion._tzlocal_get_offset_components
    dt = dt.astimezone(tz)
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\_common.py:144: in fromutc
    return f(self, dt)
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\_common.py:261: in fromutc
    _fold = self._fold_status(dt, dt_wall)
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\_common.py:196: in _fold_status
    if self.is_ambiguous(dt_wall):
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\tz.py:254: in is_ambiguous
    naive_dst = self._naive_is_dst(dt)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = tzlocal(), dt = datetime.datetime(1969, 12, 31, 21, 0, tzinfo=tzlocal())

    def _naive_is_dst(self, dt):
        timestamp = _datetime_to_timestamp(dt)
&gt;       return time.localtime(timestamp + time.timezone).tm_isdst
E       OSError: [Errno 22] Invalid argument

..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\tz.py:260: OSError</failure></testcase></testsuite></testsuites>

bdrum · 2021-12-22T08:18:56Z

asv report:

       before           after         ratio
     [2cc9ab3e]       [4f58c8a8]
     <master>         <dev23122>
-     1.69±0.01ms         46.4±1μs     0.03  sparse.GetItemMask.time_mask(False)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

[ 50.00%] · For pandas commit 4f58c8a8 <dev23122> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.24-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· sparse.GetItemMask.time_mask                                                                                                           ok
[ 75.00%] ··· ============ ============
               fill_value
              ------------ ------------
                  True      24.3±0.2ms
                 False       46.4±1μs
                  nan       22.1±0.5ms
              ============ ============

[ 75.00%] · For pandas commit 2cc9ab3e <master> (round 2/2):
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.24-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· sparse.GetItemMask.time_mask                                                                                                           ok
               fill_value
              ------------ -------------
                  True       24.8±0.8ms
                 False      1.69±0.01ms
                  nan        23.6±0.5ms
              ============ =============

jreback · 2021-12-23T23:06:24Z

@bdrum how's this coming?

bdrum · 2021-12-24T06:25:45Z

@jreback Thanks!)

I think that is done right now. I resolved your changes request: modified whatsnew file and added benchmark to asv.
So do you have any comments or remarks?

jreback · 2021-12-24T16:34:37Z

some failing builds which are relevant benchmarks and i think the doc build

bdrum · 2021-12-24T17:38:54Z

Ah, you right that's my fault. Doc doesn't know what is pull, in this case as I can see I have to use :issue: and number of this pr. Sorry about that.

I just noticed that some pr was merged despite on the fact that not all check was done, so that is the reason why I wasn't very attentive about that. Now I will.

Let me fix it.

jreback · 2021-12-24T18:08:18Z

I just noticed that some pr was merged despite on the fact that not all check was done, so that is the reason why I wasn't very attentive about that. Now I will.

well i check the failures to see if they are unrelated but generally we need all green to avoid issues

)

bdrum · 2021-12-25T20:08:50Z

@jreback Now failed checks looks like not by my fault.

jreback · 2021-12-27T13:58:21Z

thanks @bdrum

phofl · 2021-12-29T14:16:40Z

This broke some things with greater comparisons. We should fix them before 1.4 is released or revert here.

s = pd.arrays.SparseArray([1, 2, 3, 4, np.nan, np.nan], fill_value=np.nan)
s[s>2]

returned

[3.0, 4.0]
Fill: nan
IntIndex
Indices: array([0, 1], dtype=int32)

which was correct. Now this returns

[1.0, 2.0, 3.0, 4.0]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)

bdrum force-pushed the dev23122 branch from 8b18286 to 7973c8b Compare December 18, 2021 11:38

bdrum commented Dec 18, 2021

View reviewed changes

pandas/core/arrays/sparse/array.py Outdated Show resolved Hide resolved

bdrum changed the title ~~PERF: Faster SparseArray.__get_item__ for boolean masks (#23122)~~ PERF: Faster SparseArray.__getitem__ for boolean masks (#23122) Dec 18, 2021

jbrockmendel reviewed Dec 18, 2021

View reviewed changes

pandas/core/arrays/sparse/array.py Show resolved Hide resolved

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Sparse Sparse Data Type labels Dec 19, 2021

jreback added this to the 1.4 milestone Dec 19, 2021

jreback requested changes Dec 19, 2021

View reviewed changes

pandas/tests/extension/test_sparse.py Outdated Show resolved Hide resolved

pandas/tests/arrays/sparse/test_array.py Outdated Show resolved Hide resolved

bdrum force-pushed the dev23122 branch 2 times, most recently from 136a880 to 73fd5c4 Compare December 21, 2021 11:17

bdrum marked this pull request as draft December 21, 2021 11:19

bdrum force-pushed the dev23122 branch from 73fd5c4 to 6566435 Compare December 21, 2021 11:23

bdrum marked this pull request as ready for review December 21, 2021 11:31

bdrum marked this pull request as draft December 21, 2021 18:45

bdrum added a commit to bdrum/pandas that referenced this pull request Dec 21, 2021

Remove deprecation of fill_value for SparseDType[bool](pandas-dev#44955)

4f58c8a

bdrum force-pushed the dev23122 branch from 59c0043 to 4f58c8a Compare December 21, 2021 19:42

bdrum mentioned this pull request Dec 21, 2021

BUG: Sparse[bool] type allows any fill_value #45004

Closed

3 tasks

bdrum marked this pull request as ready for review December 21, 2021 20:10

bdrum requested a review from jreback December 21, 2021 20:13

bdrum mentioned this pull request Dec 23, 2021

BUG: SparseArray doesn't recalculate indices after comparison with scalar #44956

Closed

3 tasks

bdrum force-pushed the dev23122 branch from 4f58c8a to f299229 Compare December 24, 2021 06:59

PERF: Faster SparseArray.__getitem__ for boolean masks (pandas-dev#23122

88978af

)

bdrum force-pushed the dev23122 branch from f299229 to 88978af Compare December 24, 2021 19:39

jreback approved these changes Dec 27, 2021

View reviewed changes

jreback merged commit 6ba47ba into pandas-dev:master Dec 27, 2021

phofl mentioned this pull request Dec 29, 2021

BUG: Regression in getitem for SparseArray with greater comparison #45110

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Faster SparseArray.getitem for boolean masks (#23122) #44955

PERF: Faster SparseArray.getitem for boolean masks (#23122) #44955

bdrum commented Dec 18, 2021

jreback left a comment

bdrum commented Dec 20, 2021 •

edited

Loading

jreback commented Dec 20, 2021

bdrum commented Dec 21, 2021

bdrum commented Dec 21, 2021

jbrockmendel commented Dec 21, 2021

bdrum commented Dec 22, 2021 •

edited

Loading

bdrum commented Dec 22, 2021

jreback commented Dec 23, 2021

bdrum commented Dec 24, 2021 •

edited

Loading

jreback commented Dec 24, 2021

bdrum commented Dec 24, 2021 •

edited

Loading

jreback commented Dec 24, 2021

bdrum commented Dec 25, 2021

jreback commented Dec 27, 2021

phofl commented Dec 29, 2021

PERF: Faster SparseArray.__getitem__ for boolean masks (#23122) #44955

PERF: Faster SparseArray.__getitem__ for boolean masks (#23122) #44955

Conversation

bdrum commented Dec 18, 2021

jreback left a comment

Choose a reason for hiding this comment

bdrum commented Dec 20, 2021 • edited Loading

jreback commented Dec 20, 2021

bdrum commented Dec 21, 2021

bdrum commented Dec 21, 2021

jbrockmendel commented Dec 21, 2021

bdrum commented Dec 22, 2021 • edited Loading

bdrum commented Dec 22, 2021

jreback commented Dec 23, 2021

bdrum commented Dec 24, 2021 • edited Loading

jreback commented Dec 24, 2021

bdrum commented Dec 24, 2021 • edited Loading

jreback commented Dec 24, 2021

bdrum commented Dec 25, 2021

jreback commented Dec 27, 2021

phofl commented Dec 29, 2021

PERF: Faster SparseArray.getitem for boolean masks (#23122) #44955

PERF: Faster SparseArray.getitem for boolean masks (#23122) #44955

bdrum commented Dec 20, 2021 •

edited

Loading

bdrum commented Dec 22, 2021 •

edited

Loading

bdrum commented Dec 24, 2021 •

edited

Loading

bdrum commented Dec 24, 2021 •

edited

Loading