Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(expr): add initial microbenchmark for expressions #6856

Merged
merged 8 commits into from
Dec 12, 2022
Merged

Conversation

wangrunji0408
Copy link
Contributor

@wangrunji0408 wangrunji0408 commented Dec 12, 2022

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR introduces an initial microbenchmark for expressions.

Here are the bench results of some simple operations on two I32 chunks with size 1024:

expr/add/i32            time:   [3.3661 µs 3.3726 µs 3.3803 µs]
expr/mul/i32            time:   [3.4544 µs 3.4628 µs 3.4707 µs]
expr/eq/i32             time:   [3.7170 µs 3.7255 µs 3.7346 µs]
expr/sum/i32            time:   [1.2756 µs 1.2800 µs 1.2860 µs]

... and these are the corresponding results on raw Vec<i32>s:

raw/add/i32             time:   [89.131 ns 89.174 ns 89.231 ns]
raw/sum/i32             time:   [54.860 ns 54.937 ns 55.032 ns]

The comparison looks really terrible. Even considering that our evaluation has null and overflow check, while the latter does not, the gap shouldn't be as large as this. This result shows that our implementation is far from optimal and has a lot of room for improvement.

To further show where the overhead comes from, this PR introduces more benches based on the raw add:

raw/add/i32             time:   [91.600 ns 93.113 ns 94.833 ns]
raw/add/i32/zip_eq      time:   [585.07 ns 586.54 ns 588.01 ns]
raw/add/Option<i32>/zip_eq
                        time:   [938.88 ns 940.36 ns 942.10 ns]
raw/add/Option<i32>/zip_eq,checked
                        time:   [2.1017 µs 2.1104 µs 2.1201 µs]
raw/add/Option<i32>/zip_eq,cast,checked
                        time:   [2.3200 µs 2.3598 µs 2.4268 µs]
raw/add/Option<i32>/zip_eq,checked,cast,collect_array
                        time:   [3.1290 µs 3.1444 µs 3.1605 µs]
expr/add/i32            time:   [3.3661 µs 3.3726 µs 3.3803 µs]

We can see that the main costs come from:

  • zip_eq
  • option
  • error handling
  • building array

What blows my mind is the cost from the identical type cast i32.try_into::<i32>(). It seems that the compiler did not realize it is infallible and then eliminate the error handling process. On the other hand, when there was pure computation, the compiler could auto-vectorize it with SIMD. That's why the raw add on Vec<i32> performed super fast.

This analysis points out the direction of next step optimization: for primitive types, we can first apply the operation on all values without caring about null values or overflow. The benefits from SIMD are much more than saving computations from null values or errors. This conclusion has also been proven in RisingLight: risinglightdb/risinglight#700.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

#3524

Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
@codecov
Copy link

codecov bot commented Dec 12, 2022

Codecov Report

Merging #6856 (91db0e5) into main (ae398b3) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #6856      +/-   ##
==========================================
- Coverage   73.25%   73.24%   -0.01%     
==========================================
  Files        1032     1032              
  Lines      164648   164648              
==========================================
- Hits       120610   120600      -10     
- Misses      44038    44048      +10     
Flag Coverage Δ
rust 73.24% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/batch/src/executor/group_top_n.rs 68.42% <0.00%> (-6.44%) ⬇️
src/connector/src/source/filesystem/file_common.rs 80.44% <0.00%> (-0.45%) ⬇️
src/stream/src/executor/aggregation/minput.rs 96.49% <0.00%> (+0.10%) ⬆️
...frontend/src/scheduler/hummock_snapshot_manager.rs 58.79% <0.00%> (+0.50%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Signed-off-by: Runji Wang <wangrunji0408@163.com>
@BugenZhao
Copy link
Member

for primitive types, we can first apply the operation on all values without caring about null values or overflow. The benefits from SIMD are much more than saving computations from null values or errors

I was once thinking this can be done by the compiler. 🥵 However, considering the control flow (side effect) introduced by checked, it sounds reasonable for not being optimized.

src/expr/benches/expr.rs Show resolved Hide resolved
src/expr/benches/expr.rs Show resolved Hide resolved
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Copy link
Contributor

@lmatz lmatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants