ENH/PERF: optional mask in nullable dtypes #42012

mzeitlin11 · 2021-06-15T05:45:44Z

closes ENH/PERF: allow mask to be optional in our masked ExtensionArrays #30435
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Benchmarks:


       before           after         ratio
     [e3d57c98]       [ea9f1a89]
     <master>         <perf/optional_mask~1>
+        985±70ns      2.42±0.04μs     2.45  array.IntegerArray.time_constructor
+     1.11±0.08μs       2.66±0.1μs     2.38  array.BooleanArray.time_constructor
+     11.5±0.06μs       15.3±0.3μs     1.34  series_methods.NanOps.time_func('argmax', 1000, 'boolean')
+      12.6±0.4μs       16.3±0.6μs     1.29  boolean.TimeLogicalOps.time_and_array
+        25.9±1μs         32.4±3μs     1.25  boolean.TimeLogicalOps.time_or_scalar
+      13.3±0.5μs       16.2±0.6μs     1.22  boolean.TimeLogicalOps.time_or_array
+        27.5±1μs         33.3±2μs     1.21  boolean.TimeLogicalOps.time_and_scalar
+      14.5±0.4μs       17.4±0.4μs     1.20  boolean.TimeLogicalOps.time_xor_array
+      34.8±0.6μs         41.8±2μs     1.20  series_methods.Isna.time_isna_no_nans('Int64')
+      29.2±0.4μs         34.7±1μs     1.19  boolean.TimeLogicalOps.time_xor_scalar
+      27.2±0.3μs         31.0±3μs     1.14  array.IntegerArray.time_from_integer_array
-      66.9±0.6μs       58.6±0.7μs     0.88  series_methods.NanOps.time_func('skew', 1000, 'Int64')
-        78.0±2μs         66.9±1μs     0.86  series_methods.NanOps.time_func('std', 1000, 'boolean')
-      67.4±0.9μs       57.6±0.8μs     0.85  series_methods.NanOps.time_func('kurt', 1000, 'boolean')
-      66.9±0.8μs       56.9±0.7μs     0.85  series_methods.NanOps.time_func('kurt', 1000, 'Int64')
-        67.6±2μs       57.4±0.6μs     0.85  series_methods.NanOps.time_func('var', 1000, 'Int64')
-      10.8±0.2μs       9.12±0.5μs     0.85  series_methods.NanOps.time_func('sum', 1000, 'boolean')
-        75.4±5μs         63.5±1μs     0.84  series_methods.NanOps.time_func('std', 1000, 'int32')
-            5000             4016     0.80  array.NullableArrayMemory.track_array_size('Float32')
-        150±20μs        119±0.6μs     0.80  series_methods.NanOps.time_func('sem', 1000, 'int8')
-      8.81±0.2μs       6.92±0.2μs     0.79  series_methods.NanOps.time_func('max', 1000, 'boolean')
-     6.25±0.06μs       4.83±0.1μs     0.77  series_methods.All.time_all(1000, 'slow', 'boolean')
-     9.78±0.08μs       7.44±0.2μs     0.76  series_methods.NanOps.time_func('prod', 1000, 'Int64')
-     1.40±0.02ms      1.03±0.08ms     0.73  series_methods.NanOps.time_func('prod', 1000000, 'boolean')
-      9.44±0.1μs       6.83±0.1μs     0.72  series_methods.NanOps.time_func('min', 1000, 'Int64')
-      12.9±0.2μs         9.12±2μs     0.70  series_methods.NanOps.time_func('mean', 1000, 'Int64')
-     4.26±0.06ms       2.97±0.4ms     0.70  series_methods.NanOps.time_func('std', 1000000, 'boolean')
-        170±10μs          108±8μs     0.64  algos.isin.IsIn.time_isin('boolean')
-      10.2±0.3μs       6.50±0.1μs     0.63  series_methods.NanOps.time_func('max', 1000, 'Int64')
-      4.14±0.2ms       2.60±0.2ms     0.63  series_methods.NanOps.time_func('var', 1000000, 'boolean')
-      14.4±0.2μs       8.83±0.1μs     0.61  series_methods.NanOps.time_func('mean', 1000, 'boolean')
-        6.27±1ms       3.70±0.1ms     0.59  series_methods.NanOps.time_func('std', 1000000, 'Int64')
-     1.01±0.01ms         557±10μs     0.55  series_methods.NanOps.time_func('sum', 1000000, 'boolean')
-     1.27±0.04ms         696±10μs     0.55  series_methods.NanOps.time_func('prod', 1000000, 'Int64')
-            2000             1016     0.51  array.NullableArrayMemory.track_array_size('Int8')
-            2000             1016     0.51  array.NullableArrayMemory.track_array_size('boolean')
-         116±5μs         58.2±2μs     0.50  algos.isin.IsIn.time_isin_mismatched_dtype('boolean')
-       925±400μs          338±6μs     0.37  series_methods.NanOps.time_func('sum', 1000000, 'Int64')
-     1.03±0.06ms         371±30μs     0.36  series_methods.NanOps.time_func('mean', 1000000, 'Int64')
-      1.95±0.2ms          329±5μs     0.17  series_methods.NanOps.time_func('max', 1000000, 'Int64')
-      1.98±0.2ms          324±3μs     0.16  series_methods.NanOps.time_func('min', 1000000, 'Int64')
-         454±7μs         69.6±1μs     0.15  series_methods.All.time_all(1000000, 'slow', 'boolean')
-        454±10μs         64.4±2μs     0.14  series_methods.Any.time_any(1000000, 'slow', 'boolean')
-         430±6μs         43.7±2μs     0.10  series_methods.Any.time_any(1000000, 'fast', 'boolean')
-         431±7μs         42.2±1μs     0.10  series_methods.All.time_all(1000000, 'fast', 'boolean')
-         505±4μs         42.4±3μs     0.08  series_methods.NanOps.time_func('min', 1000000, 'boolean')
-        496±10μs       13.6±0.6μs     0.03  series_methods.NanOps.time_func('max', 1000000, 'boolean')

The decrease in the isna benchmark was somewhat surprising, but happens because of the following:

In [12]: mask = np.zeros(100000, dtype=np.bool_)

In [13]: %timeit mask.copy()  # what happens in master
2 µs ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit np.zeros(100000, dtype=np.bool_)  # essentially what happens in this pr
2.7 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The argmax performance decrease is caused by the isna slowdown because there are a bunch of isna.any() calls. I think this decrease could be prevented (and performance increased overall) by adding argmax to our masked reductions (or avoiding the repeated isna.any() some other way), but that seems better left for a different pr.
The TimeLogicalOps benchmark slowdown is occurring because of the mask.any() check added in the BooleanArray constructor - this slowdown could be avoided by more aggressively figuring out when mask can (or won't) be None, but that would complicate the logic somewhat, so would plan to leave as a potential followup.
Finally, the slowdown in IntegerArray and BooleanArray is also due to this added mask.any() check. This could be mitigated by determining earlier when missing values are not possible in the coerce_to_array functions, but that adds reasonable complexity to already complex functions.

mzeitlin11 · 2021-06-15T05:49:31Z

pandas/core/arrays/masked.py

@@ -404,20 +453,17 @@ def isin(self, values) -> BooleanArray:  # type: ignore[override]
        from pandas.core.arrays import BooleanArray

        result = isin(self._data, values)
-        if self._hasna:
+        if self._mask is not None:


There are multiple changes like this where _mask is not None was used instead of _hasna. _hasna seems more resistant to any future refactoring in how the mask works, but then we need an additional assert self._mask is not None afterwards anyway for mypy.

jorisvandenbossche · 2021-06-17T11:39:49Z

@mzeitlin11 thanks a lot for looking into this!

There is one problem I recently realized while thinking this through (but didn't yet comment about that on the issue), and that is: how to ensure "views" still work?

So for example, assume I create an array with no NAs (the mask will be None), then I create a view of this array (with a slice, or the view() method), and then it is expected that modifications in the one array will be reflected in the other array. But when introducing missing values, the mask gets created on the view, and this doesn't propagate to the parent (or vice versa):

In [7]: arr = pd.array([1, 2, 3, 4])

In [8]: arr2 = arr[:]

In [9]: arr2[0] = pd.NA

In [10]: arr2
Out[10]: 
<IntegerArray>
[<NA>, 2, 3, 4]
Length: 4, dtype: Int64

In [11]: arr   # <-- not mutated, but should have been
Out[11]: 
<IntegerArray>
[1, 2, 3, 4]
Length: 4, dtype: Int64

One possible solution would be to create a small Mask class that is stored on the array instead of np.ndarray | None, and this object could then actually be shared between views. And it's this object that could "hide" whether the mask is present as an ndarray or not, and if the object gets modified to add an ndarray as mask (eg when setting a NA), then the change can be reflected in the all arrays that share this Mask object.

Such a class could also have some helper methods then like has_na(), maybe_has_na() (only check if the mask is not None), or get_ndarray_mask(), etc.

However, one problem with this (that I only realize now while writing it down), is that if using such a shared Mask object, we would need to keep track of the offset/length on the Array class in case the array is sliced (because the mask would present the full mask of the parent array, and when asking for the mask array or whether there are NAs present, we of course only need the sliced part), which seems like a considerable complication ..

Another option would be to only support "read-only views" (so that if an array is a view of another, it can't be mutated), but that's certainly something that would need to be discussed.

cc @jbrockmendel @TomAugspurger

mzeitlin11 · 2021-06-17T17:25:08Z

There is one problem I recently realized while thinking this through (but didn't yet comment about that on the issue), and that is: how to ensure "views" still work?

Thanks for bringing this up, that's a serious complication that never crossed my mind :)

For now I'll try messing around with some workarounds like you mention, though I worry they might complicate the implementation too much to be worth it.

TomAugspurger · 2021-06-19T00:08:33Z

I think the only way to handle this properly is for the second array to be a view on the original IntegerArray object, rather than it's component ndarrays like we do in __getitem__ today. That seems like the only way to ensure that the state is actually ensured to be consistent.

I don't know exactly what this would look like in code, but maybe some kind of .base property with a type Optional[IntegerArray]. Then any references to ._data or ._mask would go through .base if it's present.

mzeitlin11 · 2021-06-20T01:41:59Z

Thanks for the insight @TomAugspurger and @jorisvandenbossche! Will close this for now until I make any progress on a new approach.

mzeitlin11 added 12 commits June 12, 2021 12:45

WIP

3ce7ce8

WIP

c2a08b6

WIP

8a9c97e

WIP

ed1f373

WIP

f25dcf6

WIP

e0637b4

Enforce invariant

e3758d3

Fixup benchmark

2cdb200

Fixup whatsnew

a20f820

Merge remote-tracking branch 'upstream/master' into perf/optional_mask

5e2c66e

Move whatsnew

ea9f1a8

Update benchmarks

6ce34be

mzeitlin11 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance labels Jun 15, 2021

mzeitlin11 commented Jun 15, 2021

View reviewed changes

mzeitlin11 added 2 commits June 15, 2021 09:46

mypy fixup

015fa8b

Merge remote-tracking branch 'upstream/master' into perf/optional_mask

3ca64f6

mzeitlin11 closed this Jun 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/PERF: optional mask in nullable dtypes #42012

ENH/PERF: optional mask in nullable dtypes #42012

mzeitlin11 commented Jun 15, 2021

mzeitlin11 Jun 15, 2021

jorisvandenbossche commented Jun 17, 2021

mzeitlin11 commented Jun 17, 2021

TomAugspurger commented Jun 19, 2021

mzeitlin11 commented Jun 20, 2021

ENH/PERF: optional mask in nullable dtypes #42012

ENH/PERF: optional mask in nullable dtypes #42012

Conversation

mzeitlin11 commented Jun 15, 2021

mzeitlin11 Jun 15, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 17, 2021

mzeitlin11 commented Jun 17, 2021

TomAugspurger commented Jun 19, 2021

mzeitlin11 commented Jun 20, 2021