-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up process_crosslinks(...) and get_crosslink_deltas(...) by 10x - 15x in state_sim #314
Conversation
…_attesting_indices(...) and switch inner/outer loop nesting order to get 10-15x function speedup for 128 and 512 validator cases by avoiding accidentally quadratic behavior, while keeping function signature unchanged and allowing easy ongoing verification of correctness of optimization
Carrying over a still-relevant comment, https://github.com/status-im/nim-confutils/blame/2f9598611598c2351458635865e92bc408170037/confutils.nim#L272-L276 explains the CI failures. |
…col; fix mainnet MIN_ATTESTATION_INCLUSION_DELAY preset; update get_attestation_deltas(...) to 0.8.0; for 512 validators, get 90x speedup for get_attestation_deltas(...) from 179s to 2s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I suppose the main gain was due to lots of seq allocations in inner loops?
There are a couple of GC profiling information available according to https://github.com/nim-lang/Nim/blob/devel/lib/system/gc.nim:
-d:memProfiler
GC_getStatistics()
It would be interesting to understand how we were stressing th GC before and after this change.
filterIt(attestations, it.data.crosslink == candidate_crosslink), | ||
stateCache) | ||
when false: | ||
let crosslink_balance_uncached = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a comment on the purpose of this when false
, so we known that we can remove it if it's accomplished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in b8618f1
There are some distinct cases: d8f63d2 I don't think is mostly working by memory allocations per se (though, it does help). Rather, there were lots of 61f388f is similar:
wasn't matching most attestations most times, but it was scanning them, and it was O(n^2) overall, because number of attestations is proportional to number of crosslinks. So this turns that into O(n) by avoiding repetitious scans. The 776833e actually adds memory allocation/churn. I put it there because the spec does and if it turns out to be problematic or unnecessary, it's an obvious target for removal. In particular, I don't see how it could ever be doing much useful -- it's basically deduplicating a list, but that list is internally generated by us, in a way that if there are duplicates, is already a bug. Because it's a set, order doesn't matter for the spec, so that shouldn't be a problem. I think 4590b69 is, yes, to your point, substantially operating by reducing memory churn. It's not especially accidentally-quadratic, just, slow, because of the memory issues. So it's nuanced, and what you point out is part of it, and would indeed be interesting to check, but isn't the whole explanation. |
Successor PR to #313 (issues with rebasing conflicts).
This switches inner/outer loop nesting order to get 10-15x function speedup for 128 and 512 validator cases by avoiding accidentally quadratic behavior, while keeping function signature unchanged and allowing easy ongoing verification of correctness of optimization.
This takes
process_crosslinks(...)
processing times down in my testing from 15-16s or so to 1s and from 22-24s to 1.5s.get_crosslink_deltas(...)
sees similar speedups, since it was also bottlenecked onget_winning_crosslink_and_attesting_indices(...)
.The other big epoch time-consumer is
get_attestation_deltas(...)
, which 4590b69 decreases by 90x or so for me for 512 validators, from 178s to 2s.Timings I get with 512 validators: