-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a translated bitmap to enable more efficient marking #2927
Conversation
As far as I understand, the idea here is to move bitmap address back so that when we compute the bit byte index of an object with This way we don't want to allocate bits for the static heap, as suggested in #2892. It's faster, but doesn't come with space overhead. I like it. |
Currently I am fighting with the heap's alignment when doing the tests. Here is the principal problem:
The latter will cause the sweep phase to reclaim the wrong objects (or reach into the middle of them, looking for heap tags). (The former could be compensated easily by bumping the mark blob's size by one.) Aligning the One could take the mis-alignment into account when finding the addressed objects from the bitmap, but that would add computational overhead to the sweep phase and we don't want that. UPDATE: I finally settled the matter by allocating a vector that is bigger, so that we can find a comfortable location for the intended heap in it which is by construction properly realigned. |
Looking good! Thanks for grasping the nettle! |
rts/motoko-rts/src/memory/ic.rs
Outdated
@@ -24,13 +24,18 @@ pub(crate) static mut LAST_HP: u32 = 0; | |||
|
|||
// Provided by generated code | |||
extern "C" { | |||
pub(crate) fn get_heap_base() -> u32; | |||
fn get_heap_base() -> u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@osa1 which macro generates this function? Maybe we can generate get_aligned_heap_base
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is generated by the codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ggreif, I really like this idea.
Since the whole point here is performance, it would be good to see benchmark results.
One thing that is worse in this version than the previous is iteration of bitmap words. It can be improved as explained in my inline comments. I think the performance may still be better than master as it is, but bitmap can be a few MiBs (1 MiB of bitmap addresses 32 MiB of heap). For a heap near full, it will be near 134M, which requires ~16k 64-bit word iterations. So I think improving cycles there may worth it.
rts/motoko-rts/src/memory/ic.rs
Outdated
@@ -24,13 +24,18 @@ pub(crate) static mut LAST_HP: u32 = 0; | |||
|
|||
// Provided by generated code | |||
extern "C" { | |||
pub(crate) fn get_heap_base() -> u32; | |||
fn get_heap_base() -> u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is generated by the codegen.
b8f7f46
to
eaec04d
Compare
(eventually) fixes #2892
THIS PASSES TESTS!
and instead fill the heap array in such that it gets realigned
52da308
to
9872e4d
Compare
@ulan can you have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ggreif. Added inline comments.
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
Co-authored-by: Ömer Sinan Ağacan <omeragacan@gmail.com>
review feedback
2c75bf6
to
cf98977
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but would be good to see benchmarks. Marking should be more efficient now, but I can't tell if the new bitmap iterator will be faster or slower, and what will be the effect of it to the overall GC.
It would be helpful to see the changes in the PR description. For example, number of instructions in marking or one iteration of bitmap iterator, or even better, actual benchmark results.
What happens to TODOs in the PR description? If they're for another PR should we remove them? They will be in the commit message when this is merged.
return bit_idx; | ||
} else { | ||
let shift_amt = self.current_word.trailing_zeros(); | ||
self.current_word >>= shift_amt; | ||
self.bits_left -= shift_amt; | ||
self.current_bit_idx += shift_amt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, we missed an optimisation! No inner loop necessary, just an if
, since at this place we know that the next bit is set (and self.current_word != 0
), so we can anticipate the next decisions. We can just advance and return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a quick implementation of the concept, for posterity.
diff --git a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
index 2607a4baa..234f74de8 100644
--- a/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
+++ b/rts/motoko-rts/src/gc/mark_compact/bitmap.rs
@@ -167,8 +167,8 @@ impl BitmapIter {
// Outer loop iterates 64-bit words
loop {
- // Inner loop iterates bits in the current word
- while self.current_word != 0 {
+ // Inner conditional examines the least significant bit(s) in the current word
+ if self.current_word != 0 {
if self.current_word & 0b1 != 0 {
let bit_idx = self.current_bit_idx;
self.current_word >>= 1;
@@ -177,7 +177,10 @@ impl BitmapIter {
} else {
let shift_amt = self.current_word.trailing_zeros();
self.current_word >>= shift_amt;
- self.current_bit_idx += shift_amt;
+ self.current_word >>= 1;
+ let bit_idx = self.current_bit_idx + shift_amt;
+ self.current_bit_idx = bit_idx + 1;
+ return bit_idx;
}
}
I won't commit this, as the double shift looks ugly. Maybe I can come up with a more elegant way. But this already passes the tests.
NB: the double shift is necessary, because it can happen that ctz
gives 63 and 63 + 1 == 64
shifts are undefined behaviour. Rust traps on this.
Yes, there can be surprises, although unlikely. Benchmarks will be added to the MR. I thought I can do that today, but unfortunately I had a canine emergency.
Added instruction balances.
I nuked the TODOs as these were meant to track my progress, and are now irrelevant. |
As promised, here come the performance counters for the change
A few quick percentages: [nix-shell:~/motoko]$ ghc -e '33511297/32425028*100'
103.35009425435192
[nix-shell:~/motoko]$ ghc -e '30258762/29173610*100'
103.71963565702016
[nix-shell:~/motoko]$ ghc -e '30624396/29548225*100'
103.6420834077174 We would regress more than 3% (on performing pure GC) with this PR reverted! Here is a more intuitive graph (with the methodology from #2967) |
This PR aims to improve the scanning of the marking bitmap. When the lowest bit in the current word (64-bits) is unset, we now swallow all the trailing zeros, updating the bit position and the current word, and return the former. This amounts to unrolling the inner loop once, because we know that we'll encounter another bit in the current word. Unrolling comes at the cost of a few more instructions in the else leg, but since we `return`, we can eliminate the inner `loop` altogether, which is a win especially for sparse bitmaps. ## Benchmarks This optimisation was hinted at in #2927, but not implemented there due to the lack of benchmarking data. Now the `cancan` profile creation benchmark is available, and the GC-relevant cycle-count improvement is ``` shell [nix-shell:~/motoko]$ ghc -e "100-28402346/29159337*100" 2.5960501090954153 ``` about 2.5% compared to the baseline 0.6.16 release. More benchmark data and a graph is added to #2952. ## Implementation concerns We have to use two shifts (once dynamic, and once static counts) because adding one to the dynamic count could result in a shift of 64 bits which is undefined behaviour, and Rust traps on that. We could also refrain from testing the lowest bit and go for the `ctz` directly, but that could result in worse generated code by `wasmtime` (?). OTOH that would probably eliminate the `if`, and branchless code is good! N.B.: For the branchless optimisation the benchmarks look promising but I prefer to merge this first.
Since the mark-and-compact GC is currently less performant than the copying one, it makes sense to tweak its performance. Here I take the idea from @ulan and run with it. Basically, we now pass the object's absolute word number to
g/set_bit
and thus avoid some arithmetic and passing theheap_base
along.The marking was sped up by 4 instructions. This is compounded by the fact that now we don't pass the heap base to the marking routine
mark_object
any more:mark_object
: 120 now vs. old 123 instructionsmark_object
1 instruction less.Also speeds up the iteration over the bitmap after marking (
BitmapIter::next
has now 92 instructions compared to formerly 110). In particular, the inner loop is now more compact.Fixes #2892.
Benchmarks will be included in #2927 when available.
Further optimisation opportunities
BitmapIter::next
can be eliminated.