-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add try_unary, binary, try_binary kernels ~90% faster #2666
Conversation
0815487
to
882a655
Compare
arrow/src/compute/kernels/arity.rs
Outdated
use std::sync::Arc; | ||
|
||
fn try_for_each_valid<F: FnMut(usize) -> Result<()>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This internal iteration style optimises much better than the external style when there are multiple variants, I may create a HybridBitIndexIterator
that special cases for_each
and try_fold
🤔
@@ -198,6 +259,80 @@ where | |||
} | |||
} | |||
|
|||
pub fn binary<A, B, F, O>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
#[inline] | ||
fn into_primitive_array_data<I: ArrowPrimitiveType, O: ArrowPrimitiveType>( | ||
array: &PrimitiveArray<I>, | ||
unsafe fn build_primitive_array<O: ArrowPrimitiveType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new name makes more sense than before
} | ||
try_binary(left, right, |a, b| { | ||
op(a, b).ok_or_else(|| { | ||
ArrowError::ComputeError(format!("Overflow happened on: {:?}, {:?}", a, b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we point out the op
which cause the overflow?
@@ -24,8 +24,51 @@ use super::{ | |||
PrimitiveArray, | |||
}; | |||
|
|||
/// an iterator that returns Some(T) or None, that can be used on any [`ArrayAccessor`] | |||
// Note: This implementation is based on std's [Vec]s' [IntoIter]. | |||
/// An iterator that returns Some(T) or None, that can be used on any [`ArrayAccessor`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good explanation for this API. 👍
let null_count = array.null_count(); | ||
|
||
let mut buffer = BufferBuilder::<O::Native>::new(len); | ||
buffer.append_n_zeroed(array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we need zero-initialized values for the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is UB if we don't initialize all values in the buffer, even the null slots. We must therefore zero out the nulls, in the past I have found it is faster to zero initialize everything, and override the valid indexes, than to interleave appending nulls and values.
arrow/src/compute/kernels/arity.rs
Outdated
if selectivity > 0.8 { | ||
BitSliceIterator::new(nulls.unwrap(), 0, len) | ||
.flat_map(|(start, end)| start..end) | ||
.try_for_each(f) | ||
} else { | ||
BitIndexIterator::new(nulls.unwrap(), 0, len).try_for_each(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I read this incorrectly? I think higher selectivity
here is more null values in the array? So I guess BitIndexIterator
is more performant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, i think you're right. I copied this from the filter kernel and forgot to reverse this
@@ -198,6 +259,80 @@ where | |||
} | |||
} | |||
|
|||
pub fn binary<A, B, F, O>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to add docs for binary
and try_binary
.
return Ok(PrimitiveArray::from(ArrayData::new_empty(&O::DATA_TYPE))); | ||
} | ||
|
||
let null_buffer = combine_option_bitmap(&[a.data(), b.data()], len).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let null_buffer = combine_option_bitmap(&[a.data(), b.data()], len).unwrap(); | |
let null_buffer = combine_option_bitmap(&[a.data(), b.data()], len)?; |
op: F, | ||
) -> Result<PrimitiveArray<T>> | ||
) -> Result<PrimitiveArray<LT>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the return type same as the left type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the style, the type of left and right can be different.
Do we need to support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is required to allow adding a duration to a time
90e12b1
to
fb9ae62
Compare
Final benchmarks
So anything from 50% to 99% faster for the checked kernels 🎉 . The unchecked kernels regularly vary by +-10% but we're talking single-digit microseconds, so this isn't all that surprising. |
fdd3040
to
f05a3e6
Compare
Benchmark runs are scheduled for baseline = d88ed6a and contender = 2d28010. 2d28010 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Rationale for this change
ArrayIter is great from an ergonomics perspective and definitely something we should continue to provide to users, however, the way it interleaves null masks severely makes for pretty poor performance. For primitive arrays, this can result in orders of magnitude more time spent handling the null mask than the values themselves.
What changes are included in this PR?
Are there any user-facing changes?
FYI @viirya @liukun4515