Speed up process_crosslinks(...) and get_crosslink_deltas(...) by 10x - 15x in state_sim #314

tersec · 2019-07-08T08:17:23Z

Successor PR to #313 (issues with rebasing conflicts).

This switches inner/outer loop nesting order to get 10-15x function speedup for 128 and 512 validator cases by avoiding accidentally quadratic behavior, while keeping function signature unchanged and allowing easy ongoing verification of correctness of optimization.

This takes process_crosslinks(...) processing times down in my testing from 15-16s or so to 1s and from 22-24s to 1.5s. get_crosslink_deltas(...) sees similar speedups, since it was also bottlenecked on get_winning_crosslink_and_attesting_indices(...).

The other big epoch time-consumer is get_attestation_deltas(...), which 4590b69 decreases by 90x or so for me for 512 validators, from 178s to 2s.

Timings I get with 512 validators:

All time are ms
     Average,       StdDev,          Min,          Max,      Samples,         Test
     197.915,       46.600,      118.005,      309.630,          128, Process non-epoch slot with block
    3224.563,     2258.988,     1627.217,     4821.908,            2, Process epoch slot with block
       1.997,        1.181,        0.027,        4.247,          130, Tree-hash block
       8.371,        0.291,        7.516,        9.636,          130, Retrieve committee once using get_crosslink_committee
      64.856,       18.607,       31.645,      113.677,         8320, Combine committee attestations

…_attesting_indices(...) and switch inner/outer loop nesting order to get 10-15x function speedup for 128 and 512 validator cases by avoiding accidentally quadratic behavior, while keeping function signature unchanged and allowing easy ongoing verification of correctness of optimization

tersec · 2019-07-08T08:18:46Z

Carrying over a still-relevant comment, https://github.com/status-im/nim-confutils/blame/2f9598611598c2351458635865e92bc408170037/confutils.nim#L272-L276 explains the CI failures.

…col; fix mainnet MIN_ATTESTATION_INCLUSION_DELAY preset; update get_attestation_deltas(...) to 0.8.0; for 512 validators, get 90x speedup for get_attestation_deltas(...) from 179s to 2s

mratsim

So I suppose the main gain was due to lots of seq allocations in inner loops?

There are a couple of GC profiling information available according to https://github.com/nim-lang/Nim/blob/devel/lib/system/gc.nim:

-d:memProfiler
GC_getStatistics()

It would be interesting to understand how we were stressing th GC before and after this change.

mratsim · 2019-07-08T12:28:42Z

beacon_chain/spec/state_transition_epoch.nim

-        filterIt(attestations, it.data.crosslink == candidate_crosslink),
-        stateCache)
+    when false:
+      let crosslink_balance_uncached =


We should have a comment on the purpose of this when false, so we known that we can remove it if it's accomplished.

Done in b8618f1

…ccomplished

tersec · 2019-07-08T13:19:08Z

So I suppose the main gain was due to lots of seq allocations in inner loops?

There are a couple of GC profiling information available according to https://github.com/nim-lang/Nim/blob/devel/lib/system/gc.nim:
* `-d:memProfiler`

* `GC_getStatistics()`
It would be interesting to understand how we were stressing th GC before and after this change.

There are some distinct cases:

d8f63d2 I don't think is mostly working by memory allocations per se (though, it does help). Rather, there were lots of filterIt scans with very sparse results in an inner loop.

61f388f is similar:

        get_attesting_balance(
          state,
          filterIt(attestations, it.data.crosslink == candidate_crosslink),
          stateCache)

wasn't matching most attestations most times, but it was scanning them, and it was O(n^2) overall, because number of attestations is proportional to number of crosslinks. So this turns that into O(n) by avoiding repetitious scans.

The seq to openarray change in that commit is nice, but perf-wise negligible.

776833e actually adds memory allocation/churn. I put it there because the spec does and if it turns out to be problematic or unnecessary, it's an obvious target for removal. In particular, I don't see how it could ever be doing much useful -- it's basically deduplicating a list, but that list is internally generated by us, in a way that if there are duplicates, is already a bug. Because it's a set, order doesn't matter for the spec, so that shouldn't be a problem.

I think 4590b69 is, yes, to your point, substantially operating by reducing memory churn. It's not especially accidentally-quadratic, just, slow, because of the memory issues.

So it's nuanced, and what you point out is part of it, and would indeed be interesting to check, but isn't the whole explanation.

tersec added 3 commits July 8, 2019 10:06

avoid memory allocation with an openarray rather than seq

61f388f

update get_crosslink_deltas(...) to 0.8.0

776833e

tersec requested review from arnetheduck and mratsim July 8, 2019 08:17

tersec mentioned this pull request Jul 8, 2019

Speed up process_crosslinks(...) by 10x - 15x in state_sim #313

Closed

move non-spec ValidatorSetDeltaFlags from datatypes.nim to sync_proto…

4590b69

…col; fix mainnet MIN_ATTESTATION_INCLUSION_DELAY preset; update get_attestation_deltas(...) to 0.8.0; for 512 validators, get 90x speedup for get_attestation_deltas(...) from 179s to 2s

mratsim approved these changes Jul 8, 2019

View reviewed changes

explain purpose of when false portions, so they can be removed once a…

b8618f1

…ccomplished

tersec merged commit 0a2d09e into master Jul 8, 2019

delete-merged-branch bot deleted the pas branch July 8, 2019 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up process_crosslinks(...) and get_crosslink_deltas(...) by 10x - 15x in state_sim #314

Speed up process_crosslinks(...) and get_crosslink_deltas(...) by 10x - 15x in state_sim #314

tersec commented Jul 8, 2019 •

edited

Loading

tersec commented Jul 8, 2019

mratsim left a comment

mratsim Jul 8, 2019

tersec Jul 8, 2019

tersec commented Jul 8, 2019

Speed up process_crosslinks(...) and get_crosslink_deltas(...) by 10x - 15x in state_sim #314

Speed up process_crosslinks(...) and get_crosslink_deltas(...) by 10x - 15x in state_sim #314

Conversation

tersec commented Jul 8, 2019 • edited Loading

tersec commented Jul 8, 2019

mratsim left a comment

Choose a reason for hiding this comment

mratsim Jul 8, 2019

Choose a reason for hiding this comment

tersec Jul 8, 2019

Choose a reason for hiding this comment

tersec commented Jul 8, 2019

tersec commented Jul 8, 2019 •

edited

Loading