-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify the looping structure of bitmap scanning #2952
Conversation
at the cost of unrolling the inner loop once
Benchmarking this gives (cf. #2967): Here are the raw counts:
[nix-shell:~/motoko]$ tail compacting_perf.csv compacting-improvement2952_perf.csv
==> compacting_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29548225,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29572634,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29318871,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29305196,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29326984,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29186706,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29178490,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29159337,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29173610,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,32425028,224,200,16
==> compacting-improvement2952_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28796716,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28820437,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28565764,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28551354,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28572445,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28431455,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28422158,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28402346,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28415946,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,31666572,224,200,16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, but I think @osa1 should get a chance to review too.
Nice work!
// Inner loop iterates bits in the current word | ||
while self.current_word != 0 { | ||
// Inner conditional examines the least significant bit(s) in the current word | ||
if self.current_word != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you can (carefully) refactor this to:
// Inner conditional examines the least significant bit(s) in the current word
if self.current_word != 0 {
if self.current_word & 0b1 != 0 {
let bit_idx = self.current_bit_idx;
self.current_word >>= 1;
self.current_bit_idx += 1;
return bit_idx;
} else {
let shift_amt = self.current_word.trailing_zeros();
self.current_word >>= shift_amt;
self.current_word >>= 1;
let bit_idx = self.current_bit_idx + shift_amt;
self.current_bit_idx = bit_idx + 1;
return bit_idx;
}
}
to this, pulling out the common continuation.
// Inner conditional examines the least significant bit(s) in the current word
if self.current_word != 0 {
let bit_idx = self.current_bit_idx;
if self.current_word & 0b1 == 0 {
let shift_amt = self.current_word.trailing_zeros();
self.current_word >>= shift_amt;
self.current_word >>= 1;
bit_idx += shift_amt;
}
self.current_word >>= 1;
self.current_bit_idx += 1;
return bit_idx;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next evolution will be:
diff --git a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
index 0856533ed..172e34e43 100644
--- a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
+++ b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
@@ -167,21 +167,14 @@ impl BitmapIter {
// Outer loop iterates 64-bit words
loop {
- // Inner conditional examines the least significant bit(s) in the current word
+ // Examine the least significant bit(s) in the current word
if self.current_word != 0 {
- if self.current_word & 0b1 != 0 {
- let bit_idx = self.current_bit_idx;
- self.current_word >>= 1;
- self.current_bit_idx += 1;
- return bit_idx;
- } else {
- let shift_amt = self.current_word.trailing_zeros();
- self.current_word >>= shift_amt;
- self.current_word >>= 1;
- let bit_idx = self.current_bit_idx + shift_amt;
- self.current_bit_idx = bit_idx + 1;
- return bit_idx;
- }
+ let shift_amt = self.current_word.trailing_zeros();
+ self.current_word >>= shift_amt;
+ self.current_word >>= 1;
+ let bit_idx = self.current_bit_idx + shift_amt;
+ self.current_bit_idx = bit_idx + 1;
+ return bit_idx;
}
// Move on to next word (always 64-bit boundary)
which is even simpler than your solution. But let's merge this.
I have rewritten the code which finds the next set bit in the marking bitmap (`BitmapIter::next`) to eliminate the inner `if`. This was suggested in #2952. Branchless execution indeed gives nice wins in measured cycles: ``` [nix-shell:~/motoko]$ ghc -e "100-27424377/28288331*100" 3.0541002931562105 ``` About 3% cycle reduction on GC-heavy code. More benchmarks in the PR.
This PR aims to improve the scanning of the marking bitmap. When the lowest bit in the current word (64-bits) is unset, we now swallow all the trailing zeros, updating the bit position and the current word, and return the former.
This amounts to unrolling the inner loop once, because we know that we'll encounter another bit in the current word.
Unrolling comes at the cost of a few more instructions in the else leg, but since we
return
, we can eliminate the innerloop
altogether, which is a win especially for sparse bitmaps.Benchmarks
This optimisation was hinted at in #2927, but not implemented there due to the lack of benchmarking data. Now the
cancan
profile creation benchmark is available, and the GC-relevant cycle-count improvement isabout 2.5% compared to the baseline 0.6.16 release. More benchmark data and a graph is added to #2952.
Implementation concerns
We have to use two shifts (once dynamic, and once static counts) because adding one to the dynamic count could result in a shift of 64 bits which is undefined behaviour, and Rust traps on that.
We could also refrain from testing the lowest bit and go for the
ctz
directly, but that could result in worse generated code bywasmtime
(?). OTOH that would probably eliminate theif
, and branchless code is good!N.B.: For the branchless optimisation the benchmarks look promising but I prefer to merge this first.