Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Faster SparseArray.__getitem__ for boolean masks (#23122) #44955

Merged
merged 1 commit into from
Dec 27, 2021

Conversation

bdrum
Copy link
Contributor

@bdrum bdrum commented Dec 18, 2021

  • closes Efficiency of SparseArray.__getitem__(SparseArray[bool]) #23122
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • The issue Efficiency of SparseArray.__getitem__(SparseArray[bool]) #23122 contains recommendation from @TomAugspurger to avoid densifying when boolean sparse array is used as index. I have tried few solutions including usage direct sp_index.indices in a case of fill_value is False and np.setdiff1d for opposite case.
    Usage of direct indexing is working very fast, but calculating setdiff1d is working very slow, the good solution for both case (much faster than setdiff1d and bit slower than direct indices) using masks. Here is the plot of comparison old (densifying) implementation and new one (masks):

    I understand that using mask will take more memory, but I don't think that it is crucial point here especially for bool8 type, but let's discuss.
    We can combine both approaches and use direct indices when fill value is False and use mask in the opposite case.
    One more question to discuss related with case when True value of Boolean sparse array lead to the fill value of original array. I assume that in this case fill value should be return.

I've got benchmark results via pytest-benchmark framework.

Code for benchmarking:

import pytest
from pandas.arrays import SparseArray
import numpy as np


class TestPerf:
    def setup_method(self, method):
        d = 0.8
        self.n = 1_000_000
        self.data = np.random.rand(self.n)
        genInd = lambda d: np.unique(
            np.random.randint(
                low=0, high=self.n - 1, size=int(self.n * d), dtype=np.int32
            )
        )
        self.inds = genInd(d)
        barrs = np.full(shape=self.n, fill_value=True, dtype=np.bool8)
        arr_dens_inds = genInd(d)

        self.data[self.inds] = np.nan
        barrs[arr_dens_inds] = False

        self.sp_arrs = SparseArray(self.data)
        self.sp_barrs_false = SparseArray(barrs, dtype=np.bool8, fill_value=False)
        self.sp_barrs_true = SparseArray(~barrs[d], dtype=np.bool8, fill_value=True)

    def perf_sparse_boolean_indexing_fv_true(self):
        return self.sp_arrs[self.sp_barrs_true]

    def perf_sparse_boolean_indexing_fv_false(self):
        return self.sp_arrs[self.sp_barrs_false]

    def test_perf_sparse_boolean_indexing_fv_true(self, benchmark):
        benchmark(self.perf_sparse_boolean_indexing_fv_true)

    def test_perf_sparse_boolean_indexing_fv_false(self, benchmark):
        benchmark(self.perf_sparse_boolean_indexing_fv_false)

@bdrum bdrum changed the title PERF: Faster SparseArray.__get_item__ for boolean masks (#23122) PERF: Faster SparseArray.__getitem__ for boolean masks (#23122) Dec 18, 2021
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Sparse Sparse Data Type labels Dec 19, 2021
@jreback jreback added this to the 1.4 milestone Dec 19, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note for the perf improvement. is there a bug fix here too?
do we have asv's that cover these cases?

pandas/tests/extension/test_sparse.py Outdated Show resolved Hide resolved
pandas/tests/arrays/sparse/test_array.py Outdated Show resolved Hide resolved
@bdrum
Copy link
Contributor Author

bdrum commented Dec 20, 2021

Yes, here also the bugfix, I needed to fix them because some test was failed. Seems it would be better to create separate feature.
It related to incorrect version of isna:

# before fix
s = pd.arrays.SparseArray([0,2,3,4,0,0,0],fill_value=0)
s.isna()
#[False, False, False, False, False, False, False]
#Fill: False
#IntIndex
#Indices: array([1, 2, 3])

#after fix
s
#[0, 2, 3, 4, 0, 0, 0]
#Fill: 0
#IntIndex
#Indices: array([1, 2, 3])

s.isna()

#[False, False, False, False, False, False, False]
#Fill: False
#IntIndex
#Indices: array([], dtype=int32)

About asv, no, sorry I've missed it in contribution guides (btw as well as result variable in tests...)
I will try to add asv benchmarks.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2021

@bdrum fine on the bugfix, just add a whatsnew note describing (and you can reference this PR number)

@bdrum
Copy link
Contributor Author

bdrum commented Dec 21, 2021

@bdrum fine on the bugfix, just add a whatsnew note describing (and you can reference this PR number)

@jreback Thanks!
Also I've made a changes again according to discussion with @jbrockmendel
And fixed bugs with indexes after applying unary operations.

What about increasing performance. I've added benchmark to asv, but looks strange

       before           after         ratio
     [47eb2198]       [1bb10c75]
     <master>         <dev23122>
+      22.3±0.5ms         44.0±4ms     1.97  sparse.GetItemMask.time_mask(True)
-     1.62±0.05ms         48.9±8μs     0.03  sparse.GetItemMask.time_mask(False)

Because when I ran the same code via timeit, I've got huge performance increasing for False fill_value and a bit (about 5-7%) when fill_value is True.

@bdrum bdrum force-pushed the dev23122 branch 2 times, most recently from 136a880 to 73fd5c4 Compare December 21, 2021 11:17
@bdrum bdrum marked this pull request as draft December 21, 2021 11:19
@bdrum bdrum marked this pull request as ready for review December 21, 2021 11:31
@bdrum bdrum marked this pull request as draft December 21, 2021 18:45
bdrum added a commit to bdrum/pandas that referenced this pull request Dec 21, 2021
@bdrum
Copy link
Contributor Author

bdrum commented Dec 21, 2021

Sorry, a bit confusion after merging conflicts...

Now I have 26 failed test in my branch as well as in current master:

=============================================================== short test summary info =============================================================== FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other6] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Series-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-DataFrame-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-array-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-array-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-array-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_mixed_invalid[numexpr-tzlocal()] - OSError: [... FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Index-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Index-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Index-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other6] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-Series-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-DataFrame-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-array-other3] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-array-other7] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[python-tzlocal()-array-other9] FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_mixed_invalid[python-tzlocal()] - OSError: [E... FAILED pandas/tests/indexing/test_chaining_and_caching.py::TestChaining::test_detect_chained_assignment_warning_stacklevel[3] - AssertionError: asser... FAILED pandas/tests/indexing/test_chaining_and_caching.py::TestChaining::test_detect_chained_assignment_warning_stacklevel[rhs1] - AssertionError: as... ====================================== 26 failed, 4 passed, 2 skipped, 151259 deselected, 29 warnings in 52.32s =======================================

@bdrum bdrum marked this pull request as ready for review December 21, 2021 20:10
@bdrum bdrum requested a review from jreback December 21, 2021 20:13
@jbrockmendel
Copy link
Member

Can you post the traceback for one of the arithmetic tests that failed?

@bdrum
Copy link
Contributor Author

bdrum commented Dec 22, 2021

Can you post the traceback for one of the arithmetic tests that failed?

Sure. Here it is:

FAILED pandas/tests/arithmetic/test_datetime64.py::TestDatetime64ArrayLikeComparisons::test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other3]

<?xml version="1.0" encoding="utf-8"?><testsuites><testsuite name="pytest" errors="0" failures="1" skipped="0" tests="1" time="0.974" timestamp="2021-12-22T10:48:13.621099" hostname="RUMLAB"><testcase classname="pandas.tests.arithmetic.test_datetime64.TestDatetime64ArrayLikeComparisons" name="test_dt64arr_cmp_arraylike_invalid[numexpr-tzlocal()-Index-other3]" time="0.012"><failure message="OSError: [Errno 22] Invalid argument">self = &lt;pandas.tests.arithmetic.test_datetime64.TestDatetime64ArrayLikeComparisons object at 0x000001F5FE6B3790&gt;
other = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
tz_naive_fixture = tzlocal()
box_with_array = &lt;class 'pandas.core.indexes.base.Index'&gt;

    @pytest.mark.parametrize(
        "other",
        [
            # GH#4968 invalid date/int comparisons
            list(range(10)),
            np.arange(10),
            np.arange(10).astype(np.float32),
            np.arange(10).astype(object),
            pd.timedelta_range("1ns", periods=10).array,
            np.array(pd.timedelta_range("1ns", periods=10)),
            list(pd.timedelta_range("1ns", periods=10)),
            pd.timedelta_range("1 Day", periods=10).astype(object),
            pd.period_range("1971-01-01", freq="D", periods=10).array,
            pd.period_range("1971-01-01", freq="D", periods=10).astype(object),
        ],
    )
    def test_dt64arr_cmp_arraylike_invalid(
        self, other, tz_naive_fixture, box_with_array
    ):
        tz = tz_naive_fixture
    
        dta = date_range("1970-01-01", freq="ns", periods=10, tz=tz)._data
        obj = tm.box_expected(dta, box_with_array)
&gt;       assert_invalid_comparison(obj, other, box_with_array)

pandas\tests\arithmetic\test_datetime64.py:122: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\tests\arithmetic\common.py:111: in assert_invalid_comparison
    result = xbox2(left == right)
pandas\core\ops\common.py:70: in new_method
    return method(self, other)
pandas\core\arraylike.py:40: in __eq__
    return self._cmp_method(other, operator.eq)
pandas\core\indexes\base.py:6591: in _cmp_method
    result = op(self._values, other)
pandas\core\ops\common.py:70: in new_method
    return method(self, other)
pandas\core\arraylike.py:40: in __eq__
    return self._cmp_method(other, operator.eq)
pandas\core\arrays\datetimelike.py:1024: in _cmp_method
    op, np.asarray(self.astype(object)), other
pandas\core\arrays\datetimes.py:666: in astype
    return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy)
pandas\core\arrays\datetimelike.py:413: in astype
    converted = ints_to_pydatetime(
pandas\_libs\tslibs\vectorized.pyx:177: in pandas._libs.tslibs.vectorized.ints_to_pydatetime
    local_value = tz_convert_utc_to_tzlocal(value, tz)
pandas\_libs\tslibs\tzconversion.pyx:383: in pandas._libs.tslibs.tzconversion.tz_convert_utc_to_tzlocal
    return _tz_convert_tzlocal_utc(utc_val, tz, to_utc=False, fold=fold)
pandas\_libs\tslibs\tzconversion.pyx:594: in pandas._libs.tslibs.tzconversion._tz_convert_tzlocal_utc
    delta = _tzlocal_get_offset_components(val, tz, to_utc, fold)
pandas\_libs\tslibs\tzconversion.pyx:554: in pandas._libs.tslibs.tzconversion._tzlocal_get_offset_components
    dt = dt.astimezone(tz)
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\_common.py:144: in fromutc
    return f(self, dt)
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\_common.py:261: in fromutc
    _fold = self._fold_status(dt, dt_wall)
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\_common.py:196: in _fold_status
    if self.is_ambiguous(dt_wall):
..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\tz.py:254: in is_ambiguous
    naive_dst = self._naive_is_dst(dt)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = tzlocal(), dt = datetime.datetime(1969, 12, 31, 21, 0, tzinfo=tzlocal())

    def _naive_is_dst(self, dt):
        timestamp = _datetime_to_timestamp(dt)
&gt;       return time.localtime(timestamp + time.timezone).tm_isdst
E       OSError: [Errno 22] Invalid argument

..\..\..\Anaconda3\envs\pandas-dev\lib\site-packages\dateutil\tz\tz.py:260: OSError</failure></testcase></testsuite></testsuites>

@bdrum
Copy link
Contributor Author

bdrum commented Dec 22, 2021

asv report:

       before           after         ratio
     [2cc9ab3e]       [4f58c8a8]
     <master>         <dev23122>
-     1.69±0.01ms         46.4±1μs     0.03  sparse.GetItemMask.time_mask(False)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
[ 50.00%] · For pandas commit 4f58c8a8 <dev23122> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.24-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· sparse.GetItemMask.time_mask                                                                                                           ok
[ 75.00%] ··· ============ ============
               fill_value
              ------------ ------------
                  True      24.3±0.2ms
                 False       46.4±1μs
                  nan       22.1±0.5ms
              ============ ============

[ 75.00%] · For pandas commit 2cc9ab3e <master> (round 2/2):
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.24-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· sparse.GetItemMask.time_mask                                                                                                           ok
               fill_value
              ------------ -------------
                  True       24.8±0.8ms
                 False      1.69±0.01ms
                  nan        23.6±0.5ms
              ============ =============

@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

@bdrum how's this coming?

@bdrum
Copy link
Contributor Author

bdrum commented Dec 24, 2021

@jreback Thanks!)

I think that is done right now. I resolved your changes request: modified whatsnew file and added benchmark to asv.
So do you have any comments or remarks?

@jreback
Copy link
Contributor

jreback commented Dec 24, 2021

some failing builds which are relevant benchmarks and i think the doc build

@bdrum
Copy link
Contributor Author

bdrum commented Dec 24, 2021

Ah, you right that's my fault. Doc doesn't know what is pull, in this case as I can see I have to use :issue: and number of this pr. Sorry about that.

I just noticed that some pr was merged despite on the fact that not all check was done, so that is the reason why I wasn't very attentive about that. Now I will.

Let me fix it.

@jreback
Copy link
Contributor

jreback commented Dec 24, 2021

I just noticed that some pr was merged despite on the fact that not all check was done, so that is the reason why I wasn't very attentive about that. Now I will.

well i check the failures to see if they are unrelated but generally we need all green to avoid issues

@bdrum
Copy link
Contributor Author

bdrum commented Dec 25, 2021

@jreback Now failed checks looks like not by my fault.

@jreback jreback merged commit 6ba47ba into pandas-dev:master Dec 27, 2021
@jreback
Copy link
Contributor

jreback commented Dec 27, 2021

thanks @bdrum

@phofl
Copy link
Member

phofl commented Dec 29, 2021

This broke some things with greater comparisons. We should fix them before 1.4 is released or revert here.

s = pd.arrays.SparseArray([1, 2, 3, 4, np.nan, np.nan], fill_value=np.nan)
s[s>2]

returned

[3.0, 4.0]
Fill: nan
IntIndex
Indices: array([0, 1], dtype=int32)

which was correct. Now this returns

[1.0, 2.0, 3.0, 4.0]
Fill: nan
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Efficiency of SparseArray.__getitem__(SparseArray[bool])
4 participants