Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/PERF: move np.errstate out of core array_ops up to higher level #40396

Merged
merged 3 commits into from
Mar 17, 2021

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 12, 2021

Currently, we suppress numpy warnings with np.errstate(all="ignore") inside the evaluate function in core.computation.expressions, which is is only used in core.ops.array_ops._na_arithmetic_op, which in itself is only used in core.ops.array_ops arithmetic_op() and comparison_op() (where we actually called np.errstate again, duplicatively).

So, in summary, we suppress the warnings at the level of the "array op". For the ArrayManager, we call this array op many times for each column, and repeatedly calling np.errstate(all="ignore") gives a big overhead. Luckily, it is easy to suppress the warnings once at a higher level, at the DataFrame/Series level, where those array ops are called.

That's what this PR is doing: removing np.errstate(all="ignore") in the actual array ops, and adding it in all places where we currently call the array ops.

With the benchmark case of an arithmetic op with two dataframes, this gives a considerable improvement:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(1000, 1000))
df_am = df._as_manager("array")
In [2]: %timeit df_am + df_am
18.1 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)   <-- master
8.57 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

For BM it doesn't matter for this case, since that's a single block and np.errstate would be called only once anyway. But for certain cases of df+s ops where we potentially also work column-wise for BM, it should benefit there as well.

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 12, 2021
@@ -433,7 +433,8 @@ def f(self, other, axis=default_axis, level=None, fill_value=None):

if isinstance(other, ABCDataFrame):
# Another DataFrame
new_data = self._combine_frame(other, na_op, fill_value)
with np.errstate(all="ignore"):
new_data = self._combine_frame(other, na_op, fill_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt we get this one for free bc it goes through _dispatch_frame_op which you've got setting np.errstaet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, indeed _combine_frame eventually always goes through _dispatch_frame as well (I got here because it's a place that uses get_array_op)

@@ -265,8 +268,7 @@ def comparison_op(left: ArrayLike, right: Any, op) -> ArrayLike:
with warnings.catch_warnings():
# suppress warnings from numpy about element-wise comparison
warnings.simplefilter("ignore", DeprecationWarning)
with np.errstate(all="ignore"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the catch_warnings here be done at a higher level too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but that might need to be done more targeted for the specific type of op (since we don't do anything similar in arithmetic_op), and thus this will be more complicated. So would like to start with np.errstate.

@jbrockmendel
Copy link
Member

This is unfortunate, since it is both a) uglier to do this at a higher level and b) a really nice perf improvement

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 12, 2021

Is it that much uglier? I removed it in 5 places, and I added it in 4 places in _dispatch_frame_op (which in theory could be combined into a single one) and 2 places in Series ops. So it's not that big of a difference.

@@ -376,7 +376,8 @@ def _cmp_method(self, other, op):
other = other._ndarray

pd_op = ops.get_array_op(op)
result = pd_op(self._ndarray, other)
with np.errstate(all="ignore"):
result = pd_op(self._ndarray, other)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we probably don't need to add this one here, since when going through a DataFrame/Series op, this will also already be catched on that level (and be don't consistently catch it in other EA ops). But this keeps current behaviour (and this not used in practice anyway)

@@ -6556,7 +6556,8 @@ def _dispatch_frame_op(self, right, func, axis: Optional[int] = None):
right = lib.item_from_zerodim(right)
if not is_list_like(right):
# i.e. scalar, faster than checking np.ndim(right) == 0
bm = self._mgr.apply(array_op, right=right)
with np.errstate(all="ignore"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just do the np.errstate once at the top of the method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is possible (as mentioned in #40396 (comment)). Personally, I think I prefer putting it just around the mgr operate call / array_op calls, but in the end it doesn't matter much (it's not that the other checks here could potentially gives such warning that would be incorrectly suppressed). So either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im fine with this, we can clean it up later if needbe

@@ -5087,7 +5088,8 @@ def _arith_method(self, other, op):

lvalues = self._values
rvalues = extract_array(other, extract_numpy=True)
result = ops.arithmetic_op(lvalues, rvalues, op)
with np.errstate(all="ignore"):
result = ops.arithmetic_op(lvalues, rvalues, op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not need this for _logical_method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't suppress any warnings in ops.logical_op, so I left the current behaviour as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable

@jbrockmendel
Copy link
Member

LGTM, merging this afternoon unless someone chimes in to stop me

@jreback jreback added this to the 1.3 milestone Mar 17, 2021
@jreback jreback merged commit 8588651 into pandas-dev:master Mar 17, 2021
@jreback
Copy link
Contributor

jreback commented Mar 17, 2021

thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the ops-refactor-seterr branch March 18, 2021 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants