Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize max_boolean by operating on u64 chunks #6098

Merged
merged 2 commits into from
Jul 22, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 28 additions & 5 deletions arrow-arith/src/aggregate.rs
Original file line number Diff line number Diff line change
Expand Up @@ -371,11 +371,26 @@ pub fn max_boolean(array: &BooleanArray) -> Option<bool> {
}

// Note the max bool is true (1), so short circuit as soon as we see it
array
.iter()
.find(|&b| b == Some(true))
.flatten()
.or(Some(false))
match array.nulls() {
None => array
.values()
.bit_chunks()
.iter_padded()
// We found a true if any bit is set
.map(|x| x != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I wonder if we can do similar to min_boolean where the condition is .map(|x| x != u64::MAX)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! I think I'll do it in a follow-up PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, what is the "penalty" for supporting short circuiting?

.find(|b| *b)
Comment on lines +380 to +381
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to use find_map instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like this? .find_map(|x| if x != 0 { Some(true) } else { None })

It's a personal preference, but I prefer the existing map + find

Copy link
Member

@Xuanwo Xuanwo Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about (x!=0).then_some(true). But I do agree that it's just personal preference. It's ok for me to keep existing map + find.

.or(Some(false)),
Some(nulls) => {
let validity_chunks = nulls.inner().bit_chunks().iter_padded();
let value_chunks = array.values().bit_chunks().iter_padded();
value_chunks
.zip(validity_chunks)
// We found a true if the value bit is 1, AND the validity bit is 1 for any bits in the chunk
.map(|(value_bits, validity_bits)| (value_bits & validity_bits) != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same question.

.find(|b| *b)
.or(Some(false))
}
}
}

/// Helper to compute min/max of [`ArrayAccessor`].
Expand Down Expand Up @@ -1318,6 +1333,14 @@ mod tests {
let a = BooleanArray::from(vec![Some(false), Some(true), None, Some(false), None]);
assert_eq!(Some(false), min_boolean(&a));
assert_eq!(Some(true), max_boolean(&a));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add some tests that have more than 64 values to exercise this new code?

I am thinking:

  1. all true in the first 64 bits, false in the second
  2. all false in the first 64 bits, true in the second
  3. all true (more than 64 bits, less than 128)
  4. all false (more than 64 its, less than 128)

Both with/without nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.
I added some tests in 5c40f28.
Is this what you were thinking of?

let a = BooleanArray::from(vec![Some(true), None]);
assert_eq!(Some(true), min_boolean(&a));
assert_eq!(Some(true), max_boolean(&a));

let a = BooleanArray::from(vec![Some(false), None]);
assert_eq!(Some(false), min_boolean(&a));
assert_eq!(Some(false), max_boolean(&a));
}

#[test]
Expand Down
47 changes: 47 additions & 0 deletions arrow/benches/aggregate_kernels.rs
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,53 @@ fn add_benchmark(c: &mut Criterion) {
.bench_function("min nullable", |b| b.iter(|| min_string(&nullable_strings)))
.bench_function("max nullable", |b| b.iter(|| max_string(&nullable_strings)));
}

{
let nonnull_bools_mixed = create_boolean_array(BATCH_SIZE, 0.0, 0.5);
let nonnull_bools_all_false = create_boolean_array(BATCH_SIZE, 0.0, 0.0);
let nonnull_bools_all_true = create_boolean_array(BATCH_SIZE, 0.0, 1.0);
let nullable_bool_mixed = create_boolean_array(BATCH_SIZE, 0.5, 0.5);
let nullable_bool_all_false = create_boolean_array(BATCH_SIZE, 0.5, 0.0);
let nullable_bool_all_true = create_boolean_array(BATCH_SIZE, 0.5, 1.0);
c.benchmark_group("bool")
.throughput(Throughput::Elements(BATCH_SIZE as u64))
.bench_function("min nonnull mixed", |b| {
b.iter(|| min_boolean(&nonnull_bools_mixed))
})
.bench_function("max nonnull mixed", |b| {
b.iter(|| max_boolean(&nonnull_bools_mixed))
})
.bench_function("min nonnull false", |b| {
b.iter(|| min_boolean(&nonnull_bools_all_false))
})
.bench_function("max nonnull false", |b| {
b.iter(|| max_boolean(&nonnull_bools_all_false))
})
.bench_function("min nonnull true", |b| {
b.iter(|| min_boolean(&nonnull_bools_all_true))
})
.bench_function("max nonnull true", |b| {
b.iter(|| max_boolean(&nonnull_bools_all_true))
})
.bench_function("min nullable mixed", |b| {
b.iter(|| min_boolean(&nullable_bool_mixed))
})
.bench_function("max nullable mixed", |b| {
b.iter(|| max_boolean(&nullable_bool_mixed))
})
.bench_function("min nullable false", |b| {
b.iter(|| min_boolean(&nullable_bool_all_false))
})
.bench_function("max nullable false", |b| {
b.iter(|| max_boolean(&nullable_bool_all_false))
})
.bench_function("min nullable true", |b| {
b.iter(|| min_boolean(&nullable_bool_all_true))
})
.bench_function("max nullable true", |b| {
b.iter(|| max_boolean(&nullable_bool_all_true))
});
}
}

criterion_group!(benches, add_benchmark);
Expand Down
Loading