Simplify the looping structure of bitmap scanning #2952

ggreif · 2021-12-01T19:15:32Z

This PR aims to improve the scanning of the marking bitmap. When the lowest bit in the current word (64-bits) is unset, we now swallow all the trailing zeros, updating the bit position and the current word, and return the former.
This amounts to unrolling the inner loop once, because we know that we'll encounter another bit in the current word.
Unrolling comes at the cost of a few more instructions in the else leg, but since we return, we can eliminate the inner loop altogether, which is a win especially for sparse bitmaps.

Benchmarks

This optimisation was hinted at in #2927, but not implemented there due to the lack of benchmarking data. Now the cancan profile creation benchmark is available, and the GC-relevant cycle-count improvement is

[nix-shell:~/motoko]$ ghc -e "100-28402346/29159337*100"
2.5960501090954153

about 2.5% compared to the baseline 0.6.16 release. More benchmark data and a graph is added to #2952.

Implementation concerns

We have to use two shifts (once dynamic, and once static counts) because adding one to the dynamic count could result in a shift of 64 bits which is undefined behaviour, and Rust traps on that.

We could also refrain from testing the lowest bit and go for the ctz directly, but that could result in worse generated code by wasmtime (?). OTOH that would probably eliminate the if, and branchless code is good!
N.B.: For the branchless optimisation the benchmarks look promising but I prefer to merge this first.

at the cost of unrolling the inner loop once

github-actions · 2021-12-01T19:59:34Z

Comparing from fe09dea to cf36f8b:
In terms of gas, no changes are observed in 3 tests.
In terms of size, 3 tests regressed and the mean change is +0.0%.

ggreif · 2021-12-09T00:20:25Z

Benchmarking this gives (cf. #2967):

... a nice reduction of cycle counts by ~2.5%.

Here are the raw counts:

compacting_perf.csv is baseline (v0.6.16)
compacting-improvement2952_perf.csv is this PR

[nix-shell:~/motoko]$ tail compacting_perf.csv compacting-improvement2952_perf.csv 
==> compacting_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29548225,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29572634,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29318871,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29305196,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29326984,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29186706,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29178490,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29159337,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,29173610,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,32425028,224,200,16

==> compacting-improvement2952_perf.csv <==
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28796716,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28820437,222,54,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28565764,224,37,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28551354,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28572445,223,36,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28431455,223,26,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28422158,225,23,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28402346,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,28415946,224,22,16
rwlgt-iiaaa-aaaaa-aaaaa-cai,canister_update createProfile,31666572,224,200,16

crusso

Added some comments, but I think @osa1 should get a chance to review too.

Nice work!

crusso · 2021-12-09T10:15:48Z

rts/motoko-rts/src/gc/mark_compact/bitmap.rs

-            // Inner loop iterates bits in the current word
-            while self.current_word != 0 {
+            // Inner conditional examines the least significant bit(s) in the current word
+            if self.current_word != 0 {


I wonder if you can (carefully) refactor this to:

// Inner conditional examines the least significant bit(s) in the current word if self.current_word != 0 { if self.current_word & 0b1 != 0 { let bit_idx = self.current_bit_idx; self.current_word >>= 1; self.current_bit_idx += 1; return bit_idx; } else { let shift_amt = self.current_word.trailing_zeros(); self.current_word >>= shift_amt; self.current_word >>= 1; let bit_idx = self.current_bit_idx + shift_amt; self.current_bit_idx = bit_idx + 1; return bit_idx; } }

to this, pulling out the common continuation.

// Inner conditional examines the least significant bit(s) in the current word if self.current_word != 0 { let bit_idx = self.current_bit_idx; if self.current_word & 0b1 == 0 { let shift_amt = self.current_word.trailing_zeros(); self.current_word >>= shift_amt; self.current_word >>= 1; bit_idx += shift_amt; } self.current_word >>= 1; self.current_bit_idx += 1; return bit_idx; }

The next evolution will be:

diff --git a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs index 0856533ed..172e34e43 100644 --- a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs +++ b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs @@ -167,21 +167,14 @@ impl BitmapIter { // Outer loop iterates 64-bit words loop { - // Inner conditional examines the least significant bit(s) in the current word + // Examine the least significant bit(s) in the current word if self.current_word != 0 { - if self.current_word & 0b1 != 0 { - let bit_idx = self.current_bit_idx; - self.current_word >>= 1; - self.current_bit_idx += 1; - return bit_idx; - } else { - let shift_amt = self.current_word.trailing_zeros(); - self.current_word >>= shift_amt; - self.current_word >>= 1; - let bit_idx = self.current_bit_idx + shift_amt; - self.current_bit_idx = bit_idx + 1; - return bit_idx; - } + let shift_amt = self.current_word.trailing_zeros(); + self.current_word >>= shift_amt; + self.current_word >>= 1; + let bit_idx = self.current_bit_idx + shift_amt; + self.current_bit_idx = bit_idx + 1; + return bit_idx; } // Move on to next word (always 64-bit boundary)

which is even simpler than your solution. But let's merge this.

I have rewritten the code which finds the next set bit in the marking bitmap (`BitmapIter::next`) to eliminate the inner `if`. This was suggested in #2952. Branchless execution indeed gives nice wins in measured cycles: ``` [nix-shell:~/motoko]$ ghc -e "100-27424377/28288331*100" 3.0541002931562105 ``` About 3% cycle reduction on GC-heavy code. More benchmarks in the PR.

WIP: simplify the looping structure

695fa77

at the cost of unrolling the inner loop once

ggreif added 3 commits December 2, 2021 09:48

sanity-check

2e34b86

fmt

217c850

Merge branch 'master' into gabor/inner-loop

3eb0506

ggreif changed the title ~~WIP: simplify the looping structure~~ Simplify the looping structure of bitmap scanning Dec 9, 2021

ggreif marked this pull request as ready for review December 9, 2021 08:09

ggreif requested review from crusso, osa1 and nomeata December 9, 2021 08:10

crusso reviewed Dec 9, 2021

View reviewed changes

ggreif requested a review from crusso December 9, 2021 10:36

osa1 approved these changes Dec 9, 2021

View reviewed changes

ggreif added the automerge-squash When ready, merge (using squash) label Dec 9, 2021

Merge branch 'master' into gabor/inner-loop

cf36f8b

mergify bot merged commit 9bd0c7d into master Dec 9, 2021

mergify bot deleted the gabor/inner-loop branch December 9, 2021 12:47

mergify bot removed the automerge-squash When ready, merge (using squash) label Dec 9, 2021

ggreif mentioned this pull request Dec 9, 2021

Make bitmap scanning branchless #2973

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the looping structure of bitmap scanning #2952

Simplify the looping structure of bitmap scanning #2952

ggreif commented Dec 1, 2021 •

edited

Loading

github-actions bot commented Dec 1, 2021 •

edited

Loading

ggreif commented Dec 9, 2021

crusso left a comment

crusso Dec 9, 2021

ggreif Dec 9, 2021

Simplify the looping structure of bitmap scanning #2952

Simplify the looping structure of bitmap scanning #2952

Conversation

ggreif commented Dec 1, 2021 • edited Loading

Benchmarks

Implementation concerns

github-actions bot commented Dec 1, 2021 • edited Loading

ggreif commented Dec 9, 2021

crusso left a comment

Choose a reason for hiding this comment

crusso Dec 9, 2021

Choose a reason for hiding this comment

ggreif Dec 9, 2021

Choose a reason for hiding this comment

ggreif commented Dec 1, 2021 •

edited

Loading

github-actions bot commented Dec 1, 2021 •

edited

Loading