Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify the looping structure of bitmap scanning #2952

Merged
merged 5 commits into from
Dec 9, 2021
Merged

Conversation

ggreif
Copy link
Contributor

@ggreif ggreif commented Dec 1, 2021

This PR aims to improve the scanning of the marking bitmap. When the lowest bit in the current word (64-bits) is unset, we now swallow all the trailing zeros, updating the bit position and the current word, and return the former.
This amounts to unrolling the inner loop once, because we know that we'll encounter another bit in the current word.
Unrolling comes at the cost of a few more instructions in the else leg, but since we return, we can eliminate the inner loop altogether, which is a win especially for sparse bitmaps.

Benchmarks

This optimisation was hinted at in #2927, but not implemented there due to the lack of benchmarking data. Now the cancan profile creation benchmark is available, and the GC-relevant cycle-count improvement is

[nix-shell:~/motoko]$ ghc -e "100-28402346/29159337*100"
2.5960501090954153

about 2.5% compared to the baseline 0.6.16 release. More benchmark data and a graph is added to #2952.

Implementation concerns

We have to use two shifts (once dynamic, and once static counts) because adding one to the dynamic count could result in a shift of 64 bits which is undefined behaviour, and Rust traps on that.

We could also refrain from testing the lowest bit and go for the ctz directly, but that could result in worse generated code by wasmtime (?). OTOH that would probably eliminate the if, and branchless code is good!
N.B.: For the branchless optimisation the benchmarks look promising but I prefer to merge this first.

at the cost of unrolling the inner loop once
@github-actions
Copy link

github-actions bot commented Dec 1, 2021

Comparing from fe09dea to cf36f8b:
In terms of gas, no changes are observed in 3 tests.
In terms of size, 3 tests regressed and the mean change is +0.0%.

@ggreif
Copy link
Contributor Author

ggreif commented Dec 9, 2021

Benchmarking this gives (cf. #2967):
2952-improvement
... a nice reduction of cycle counts by ~2.5%.


Here are the raw counts:

  • compacting_perf.csv is baseline (v0.6.16)
  • compacting-improvement2952_perf.csv is this PR
[nix-shell:~/motoko]$ tail compacting_perf.csv compacting-improvement2952_perf.csv 
==> compacting_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29548225,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29572634,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29318871,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29305196,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29326984,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29186706,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29178490,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29159337,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29173610,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,32425028,224,200,16

==> compacting-improvement2952_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28796716,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28820437,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28565764,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28551354,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28572445,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28431455,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28422158,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28402346,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28415946,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,31666572,224,200,16

@ggreif ggreif changed the title WIP: simplify the looping structure Simplify the looping structure of bitmap scanning Dec 9, 2021
@ggreif ggreif marked this pull request as ready for review December 9, 2021 08:09
@ggreif ggreif requested review from crusso, osa1 and nomeata December 9, 2021 08:10
Copy link
Contributor

@crusso crusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, but I think @osa1 should get a chance to review too.

Nice work!

// Inner loop iterates bits in the current word
while self.current_word != 0 {
// Inner conditional examines the least significant bit(s) in the current word
if self.current_word != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you can (carefully) refactor this to:

           // Inner conditional examines the least significant bit(s) in the current word
            if self.current_word != 0 {
                if self.current_word & 0b1 != 0 {
                    let bit_idx = self.current_bit_idx;
                    self.current_word >>= 1;
                    self.current_bit_idx += 1;
                    return bit_idx;
                } else {
                    let shift_amt = self.current_word.trailing_zeros();
                    self.current_word >>= shift_amt;
                    self.current_word >>= 1;
                    let bit_idx = self.current_bit_idx + shift_amt;
                    self.current_bit_idx = bit_idx + 1;
                    return bit_idx;
                }
            }

to this, pulling out the common continuation.

           // Inner conditional examines the least significant bit(s) in the current word
            if self.current_word != 0 {
                let bit_idx = self.current_bit_idx;
                if self.current_word & 0b1 == 0 {
                   let shift_amt = self.current_word.trailing_zeros();
                   self.current_word >>= shift_amt;
                   self.current_word >>= 1;
                   bit_idx += shift_amt;
                }
                self.current_word >>= 1;
                self.current_bit_idx += 1;
                return bit_idx;		
            }
 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next evolution will be:

diff --git a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
index 0856533ed..172e34e43 100644
--- a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
+++ b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
@@ -167,21 +167,14 @@ impl BitmapIter {
 
         // Outer loop iterates 64-bit words
         loop {
-            // Inner conditional examines the least significant bit(s) in the current word
+            // Examine the least significant bit(s) in the current word
             if self.current_word != 0 {
-                if self.current_word & 0b1 != 0 {
-                    let bit_idx = self.current_bit_idx;
-                    self.current_word >>= 1;
-                    self.current_bit_idx += 1;
-                    return bit_idx;
-                } else {
-                    let shift_amt = self.current_word.trailing_zeros();
-                    self.current_word >>= shift_amt;
-                    self.current_word >>= 1;
-                    let bit_idx = self.current_bit_idx + shift_amt;
-                    self.current_bit_idx = bit_idx + 1;
-                    return bit_idx;
-                }
+                let shift_amt = self.current_word.trailing_zeros();
+                self.current_word >>= shift_amt;
+                self.current_word >>= 1;
+                let bit_idx = self.current_bit_idx + shift_amt;
+                self.current_bit_idx = bit_idx + 1;
+                return bit_idx;
             }
 
             // Move on to next word (always 64-bit boundary)

which is even simpler than your solution. But let's merge this.

@ggreif ggreif requested a review from crusso December 9, 2021 10:36
@ggreif ggreif added the automerge-squash When ready, merge (using squash) label Dec 9, 2021
@mergify mergify bot merged commit 9bd0c7d into master Dec 9, 2021
@mergify mergify bot deleted the gabor/inner-loop branch December 9, 2021 12:47
@mergify mergify bot removed the automerge-squash When ready, merge (using squash) label Dec 9, 2021
mergify bot pushed a commit that referenced this pull request Dec 9, 2021
I have rewritten the code which finds the next set bit in the marking bitmap (`BitmapIter::next`) to eliminate the inner `if`. This was suggested in #2952.

Branchless execution indeed gives nice wins in measured cycles:
```
[nix-shell:~/motoko]$ ghc -e "100-27424377/28288331*100"
3.0541002931562105
```
About 3% cycle reduction on GC-heavy code. More benchmarks in the PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants