Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up process_crosslinks(...) and get_crosslink_deltas(...) by 10x - 15x in state_sim #314

Merged
merged 5 commits into from
Jul 8, 2019

Conversation

tersec
Copy link
Contributor

@tersec tersec commented Jul 8, 2019

Successor PR to #313 (issues with rebasing conflicts).

This switches inner/outer loop nesting order to get 10-15x function speedup for 128 and 512 validator cases by avoiding accidentally quadratic behavior, while keeping function signature unchanged and allowing easy ongoing verification of correctness of optimization.

This takes process_crosslinks(...) processing times down in my testing from 15-16s or so to 1s and from 22-24s to 1.5s. get_crosslink_deltas(...) sees similar speedups, since it was also bottlenecked on get_winning_crosslink_and_attesting_indices(...).

The other big epoch time-consumer is get_attestation_deltas(...), which 4590b69 decreases by 90x or so for me for 512 validators, from 178s to 2s.

Timings I get with 512 validators:

All time are ms
     Average,       StdDev,          Min,          Max,      Samples,         Test
     197.915,       46.600,      118.005,      309.630,          128, Process non-epoch slot with block
    3224.563,     2258.988,     1627.217,     4821.908,            2, Process epoch slot with block
       1.997,        1.181,        0.027,        4.247,          130, Tree-hash block
       8.371,        0.291,        7.516,        9.636,          130, Retrieve committee once using get_crosslink_committee
      64.856,       18.607,       31.645,      113.677,         8320, Combine committee attestations

tersec added 3 commits July 8, 2019 10:06
…_attesting_indices(...) and switch inner/outer loop nesting order to get 10-15x function speedup for 128 and 512 validator cases by avoiding accidentally quadratic behavior, while keeping function signature unchanged and allowing easy ongoing verification of correctness of optimization
@tersec
Copy link
Contributor Author

tersec commented Jul 8, 2019

Carrying over a still-relevant comment, https://github.com/status-im/nim-confutils/blame/2f9598611598c2351458635865e92bc408170037/confutils.nim#L272-L276 explains the CI failures.

…col; fix mainnet MIN_ATTESTATION_INCLUSION_DELAY preset; update get_attestation_deltas(...) to 0.8.0; for 512 validators, get 90x speedup for get_attestation_deltas(...) from 179s to 2s
Copy link
Contributor

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I suppose the main gain was due to lots of seq allocations in inner loops?

There are a couple of GC profiling information available according to https://github.com/nim-lang/Nim/blob/devel/lib/system/gc.nim:

  • -d:memProfiler
  • GC_getStatistics()

It would be interesting to understand how we were stressing th GC before and after this change.

filterIt(attestations, it.data.crosslink == candidate_crosslink),
stateCache)
when false:
let crosslink_balance_uncached =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a comment on the purpose of this when false, so we known that we can remove it if it's accomplished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b8618f1

@tersec
Copy link
Contributor Author

tersec commented Jul 8, 2019

So I suppose the main gain was due to lots of seq allocations in inner loops?

There are a couple of GC profiling information available according to https://github.com/nim-lang/Nim/blob/devel/lib/system/gc.nim:

* `-d:memProfiler`

* `GC_getStatistics()`

It would be interesting to understand how we were stressing th GC before and after this change.

There are some distinct cases:

d8f63d2 I don't think is mostly working by memory allocations per se (though, it does help). Rather, there were lots of filterIt scans with very sparse results in an inner loop.

61f388f is similar:

        get_attesting_balance(
          state,
          filterIt(attestations, it.data.crosslink == candidate_crosslink),
          stateCache)

wasn't matching most attestations most times, but it was scanning them, and it was O(n^2) overall, because number of attestations is proportional to number of crosslinks. So this turns that into O(n) by avoiding repetitious scans.

The seq to openarray change in that commit is nice, but perf-wise negligible.

776833e actually adds memory allocation/churn. I put it there because the spec does and if it turns out to be problematic or unnecessary, it's an obvious target for removal. In particular, I don't see how it could ever be doing much useful -- it's basically deduplicating a list, but that list is internally generated by us, in a way that if there are duplicates, is already a bug. Because it's a set, order doesn't matter for the spec, so that shouldn't be a problem.

I think 4590b69 is, yes, to your point, substantially operating by reducing memory churn. It's not especially accidentally-quadratic, just, slow, because of the memory issues.

So it's nuanced, and what you point out is part of it, and would indeed be interesting to check, but isn't the whole explanation.

@tersec tersec merged commit 0a2d09e into master Jul 8, 2019
@delete-merged-branch delete-merged-branch bot deleted the pas branch July 8, 2019 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants