Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: optional mask in nullable dtypes #42012

Closed
wants to merge 14 commits into from

Conversation

mzeitlin11
Copy link
Member

Benchmarks:


       before           after         ratio
     [e3d57c98]       [ea9f1a89]
     <master>         <perf/optional_mask~1>
+        985±70ns      2.42±0.04μs     2.45  array.IntegerArray.time_constructor
+     1.11±0.08μs       2.66±0.1μs     2.38  array.BooleanArray.time_constructor
+     11.5±0.06μs       15.3±0.3μs     1.34  series_methods.NanOps.time_func('argmax', 1000, 'boolean')
+      12.6±0.4μs       16.3±0.6μs     1.29  boolean.TimeLogicalOps.time_and_array
+        25.9±1μs         32.4±3μs     1.25  boolean.TimeLogicalOps.time_or_scalar
+      13.3±0.5μs       16.2±0.6μs     1.22  boolean.TimeLogicalOps.time_or_array
+        27.5±1μs         33.3±2μs     1.21  boolean.TimeLogicalOps.time_and_scalar
+      14.5±0.4μs       17.4±0.4μs     1.20  boolean.TimeLogicalOps.time_xor_array
+      34.8±0.6μs         41.8±2μs     1.20  series_methods.Isna.time_isna_no_nans('Int64')
+      29.2±0.4μs         34.7±1μs     1.19  boolean.TimeLogicalOps.time_xor_scalar
+      27.2±0.3μs         31.0±3μs     1.14  array.IntegerArray.time_from_integer_array
-      66.9±0.6μs       58.6±0.7μs     0.88  series_methods.NanOps.time_func('skew', 1000, 'Int64')
-        78.0±2μs         66.9±1μs     0.86  series_methods.NanOps.time_func('std', 1000, 'boolean')
-      67.4±0.9μs       57.6±0.8μs     0.85  series_methods.NanOps.time_func('kurt', 1000, 'boolean')
-      66.9±0.8μs       56.9±0.7μs     0.85  series_methods.NanOps.time_func('kurt', 1000, 'Int64')
-        67.6±2μs       57.4±0.6μs     0.85  series_methods.NanOps.time_func('var', 1000, 'Int64')
-      10.8±0.2μs       9.12±0.5μs     0.85  series_methods.NanOps.time_func('sum', 1000, 'boolean')
-        75.4±5μs         63.5±1μs     0.84  series_methods.NanOps.time_func('std', 1000, 'int32')
-            5000             4016     0.80  array.NullableArrayMemory.track_array_size('Float32')
-        150±20μs        119±0.6μs     0.80  series_methods.NanOps.time_func('sem', 1000, 'int8')
-      8.81±0.2μs       6.92±0.2μs     0.79  series_methods.NanOps.time_func('max', 1000, 'boolean')
-     6.25±0.06μs       4.83±0.1μs     0.77  series_methods.All.time_all(1000, 'slow', 'boolean')
-     9.78±0.08μs       7.44±0.2μs     0.76  series_methods.NanOps.time_func('prod', 1000, 'Int64')
-     1.40±0.02ms      1.03±0.08ms     0.73  series_methods.NanOps.time_func('prod', 1000000, 'boolean')
-      9.44±0.1μs       6.83±0.1μs     0.72  series_methods.NanOps.time_func('min', 1000, 'Int64')
-      12.9±0.2μs         9.12±2μs     0.70  series_methods.NanOps.time_func('mean', 1000, 'Int64')
-     4.26±0.06ms       2.97±0.4ms     0.70  series_methods.NanOps.time_func('std', 1000000, 'boolean')
-        170±10μs          108±8μs     0.64  algos.isin.IsIn.time_isin('boolean')
-      10.2±0.3μs       6.50±0.1μs     0.63  series_methods.NanOps.time_func('max', 1000, 'Int64')
-      4.14±0.2ms       2.60±0.2ms     0.63  series_methods.NanOps.time_func('var', 1000000, 'boolean')
-      14.4±0.2μs       8.83±0.1μs     0.61  series_methods.NanOps.time_func('mean', 1000, 'boolean')
-        6.27±1ms       3.70±0.1ms     0.59  series_methods.NanOps.time_func('std', 1000000, 'Int64')
-     1.01±0.01ms         557±10μs     0.55  series_methods.NanOps.time_func('sum', 1000000, 'boolean')
-     1.27±0.04ms         696±10μs     0.55  series_methods.NanOps.time_func('prod', 1000000, 'Int64')
-            2000             1016     0.51  array.NullableArrayMemory.track_array_size('Int8')
-            2000             1016     0.51  array.NullableArrayMemory.track_array_size('boolean')
-         116±5μs         58.2±2μs     0.50  algos.isin.IsIn.time_isin_mismatched_dtype('boolean')
-       925±400μs          338±6μs     0.37  series_methods.NanOps.time_func('sum', 1000000, 'Int64')
-     1.03±0.06ms         371±30μs     0.36  series_methods.NanOps.time_func('mean', 1000000, 'Int64')
-      1.95±0.2ms          329±5μs     0.17  series_methods.NanOps.time_func('max', 1000000, 'Int64')
-      1.98±0.2ms          324±3μs     0.16  series_methods.NanOps.time_func('min', 1000000, 'Int64')
-         454±7μs         69.6±1μs     0.15  series_methods.All.time_all(1000000, 'slow', 'boolean')
-        454±10μs         64.4±2μs     0.14  series_methods.Any.time_any(1000000, 'slow', 'boolean')
-         430±6μs         43.7±2μs     0.10  series_methods.Any.time_any(1000000, 'fast', 'boolean')
-         431±7μs         42.2±1μs     0.10  series_methods.All.time_all(1000000, 'fast', 'boolean')
-         505±4μs         42.4±3μs     0.08  series_methods.NanOps.time_func('min', 1000000, 'boolean')
-        496±10μs       13.6±0.6μs     0.03  series_methods.NanOps.time_func('max', 1000000, 'boolean')

  • The decrease in the isna benchmark was somewhat surprising, but happens because of the following:
In [12]: mask = np.zeros(100000, dtype=np.bool_)

In [13]: %timeit mask.copy()  # what happens in master
2 µs ± 19.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [14]: %timeit np.zeros(100000, dtype=np.bool_)  # essentially what happens in this pr
2.7 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  • The argmax performance decrease is caused by the isna slowdown because there are a bunch of isna.any() calls. I think this decrease could be prevented (and performance increased overall) by adding argmax to our masked reductions (or avoiding the repeated isna.any() some other way), but that seems better left for a different pr.

  • The TimeLogicalOps benchmark slowdown is occurring because of the mask.any() check added in the BooleanArray constructor - this slowdown could be avoided by more aggressively figuring out when mask can (or won't) be None, but that would complicate the logic somewhat, so would plan to leave as a potential followup.

  • Finally, the slowdown in IntegerArray and BooleanArray is also due to this added mask.any() check. This could be mitigated by determining earlier when missing values are not possible in the coerce_to_array functions, but that adds reasonable complexity to already complex functions.

@mzeitlin11 mzeitlin11 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance labels Jun 15, 2021
@@ -404,20 +453,17 @@ def isin(self, values) -> BooleanArray: # type: ignore[override]
from pandas.core.arrays import BooleanArray

result = isin(self._data, values)
if self._hasna:
if self._mask is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple changes like this where _mask is not None was used instead of _hasna. _hasna seems more resistant to any future refactoring in how the mask works, but then we need an additional assert self._mask is not None afterwards anyway for mypy.

@jorisvandenbossche
Copy link
Member

@mzeitlin11 thanks a lot for looking into this!

There is one problem I recently realized while thinking this through (but didn't yet comment about that on the issue), and that is: how to ensure "views" still work?

So for example, assume I create an array with no NAs (the mask will be None), then I create a view of this array (with a slice, or the view() method), and then it is expected that modifications in the one array will be reflected in the other array. But when introducing missing values, the mask gets created on the view, and this doesn't propagate to the parent (or vice versa):

In [7]: arr = pd.array([1, 2, 3, 4])

In [8]: arr2 = arr[:]

In [9]: arr2[0] = pd.NA

In [10]: arr2
Out[10]: 
<IntegerArray>
[<NA>, 2, 3, 4]
Length: 4, dtype: Int64

In [11]: arr   # <-- not mutated, but should have been
Out[11]: 
<IntegerArray>
[1, 2, 3, 4]
Length: 4, dtype: Int64

One possible solution would be to create a small Mask class that is stored on the array instead of np.ndarray | None, and this object could then actually be shared between views. And it's this object that could "hide" whether the mask is present as an ndarray or not, and if the object gets modified to add an ndarray as mask (eg when setting a NA), then the change can be reflected in the all arrays that share this Mask object.

Such a class could also have some helper methods then like has_na(), maybe_has_na() (only check if the mask is not None), or get_ndarray_mask(), etc.

However, one problem with this (that I only realize now while writing it down), is that if using such a shared Mask object, we would need to keep track of the offset/length on the Array class in case the array is sliced (because the mask would present the full mask of the parent array, and when asking for the mask array or whether there are NAs present, we of course only need the sliced part), which seems like a considerable complication ..

Another option would be to only support "read-only views" (so that if an array is a view of another, it can't be mutated), but that's certainly something that would need to be discussed.

cc @jbrockmendel @TomAugspurger

@mzeitlin11
Copy link
Member Author

There is one problem I recently realized while thinking this through (but didn't yet comment about that on the issue), and that is: how to ensure "views" still work?

Thanks for bringing this up, that's a serious complication that never crossed my mind :)

For now I'll try messing around with some workarounds like you mention, though I worry they might complicate the implementation too much to be worth it.

@TomAugspurger
Copy link
Contributor

I think the only way to handle this properly is for the second array to be a view on the original IntegerArray object, rather than it's component ndarrays like we do in __getitem__ today. That seems like the only way to ensure that the state is actually ensured to be consistent.

I don't know exactly what this would look like in code, but maybe some kind of .base property with a type Optional[IntegerArray]. Then any references to ._data or ._mask would go through .base if it's present.

@mzeitlin11
Copy link
Member Author

Thanks for the insight @TomAugspurger and @jorisvandenbossche! Will close this for now until I make any progress on a new approach.

@mzeitlin11 mzeitlin11 closed this Jun 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH/PERF: allow mask to be optional in our masked ExtensionArrays
3 participants