Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Revert reversion of authority-discovery enabling #1544

Closed
tomaka opened this issue Aug 5, 2020 · 6 comments · Fixed by #1807
Closed

Revert reversion of authority-discovery enabling #1544

tomaka opened this issue Aug 5, 2020 · 6 comments · Fixed by #1807
Assignees

Comments

@tomaka
Copy link
Contributor

tomaka commented Aug 5, 2020

#1532 should be reverted again.
I don't know under which criteria, though.

@andresilva
Copy link
Contributor

andresilva commented Aug 5, 2020

I don't really have a suggestion on expected absolute numbers, but certainly spawning 4k sockets so that we can connect to an extra 10 authorities is excessive (I know it's because of the DHT querying). With authority discovery enabled the lower bound for the number of sockets on my local node was about 1.4k sockets. We should try to reduce this number before re-enabling it.

@tomaka
Copy link
Contributor Author

tomaka commented Aug 14, 2020

cc paritytech/substrate#5612

@mxinden
Copy link
Contributor

mxinden commented Aug 14, 2020

Documenting insights looking at the metrics of a single sentry node running with --enable-authority-discovery still without libp2p/rust-libp2p#1698:

  • Successful lookup ups are fast. They are either served right from the local cache (< 2 secs) or quickly looked up (~< 20 secs). See yellow line below.

  • Eventually failing lookups are the slow ones. The 95th percentile is around 1 minute. The 1 minute likely correlates with the default query timeout. See blue line below.

  • For the reference number of TCP connections in green (scaled down by / 100).

image

@mxinden
Copy link
Contributor

mxinden commented Aug 21, 2020

With libp2p/rust-libp2p#1698 included in Substrate via paritytech/substrate#6891 idle connections are now properly cleaned up.

Below you see graphs of two nodes, kusama-sentry-ae1-0 running with --enable-authority-discovery and kusama-sentry-ae1-1 without --enable-authority-discovery.

image

While the connection count still spikes periodically (every 10 minutes) to around ~1600 connetions, the connection count returns back to the same level (~220) of node kusama-sentry-ae1-1 within less than 2 minutes. Withinin Substrate we set the soft file descriptor limit to the OS hard one, thus 1600 connections should not be a problem in itself.

image

The CPU load statistics of the two nodes do not show a correlation between periodic lookups and CPU usage.

With the above in mind do you feel comfortable re-enabling authority discovery by default @andresilva @tomaka? Instead of reverting the revert right away I would suggest enabling it on all of Parity's Kusama nodes explicitly for a week and only then rolling it out to all Kusama and Polkadot nodes.

In the future the authority discovery module will need to lookup addresses of past session validators as well. At this point we likely need to spread the lookups over the entire 10 minutes instead of doing them all at once.

@tomaka
Copy link
Contributor Author

tomaka commented Aug 21, 2020

I don't really see how fixing the keep-alive mechanism would significantly reduce the number of TCP connections?
To me it is quite likely that the connections that we were accidentally keeping alive would be reused for the next authority discovery round, and thus don't contribute to increasing the peak of connections.

In the same release where we disabled authority-discovery by default, we also shipped paritytech/substrate#6549 and paritytech/substrate#6822. These two PRs alone might have fixed the problem entirely, but we can't know.

Unfortunately I don't know what the threshold is before validators start complaining about reaching the limit. The Parity nodes in particular didn't reach the fdlimit.
Since the Parity nodes also don't all run the same version and not always the latest version, it's also very painful to find that information on Grafana.

@mxinden
Copy link
Contributor

mxinden commented Aug 24, 2020

Unfortunately I don't know what the threshold is before validators start complaining about reaching the limit.

I have asked the validator community what the file descriptor limit on their nodes is. Sadly only 2 reported back. Both of them had a hard limit of 1_048_576 and a soft limit of 1024. Given that Substrate sets the soft limit to the hard limit (see #1544 (comment)) the latter can be ignored.

In addition I reached out to our internal operations team. Our nodes are running with a hard limit of 524_288.

With the above in mind and the fact that the amount of connections initiated through authority discovery is below 10_000 I am having difficulties understanding how the authority discovery module exhausted the file descriptor limit of a node.


As suggested above I would like to enable the authority discovery module on all Parity nodes once a new Polkadot version with libp2p 0.24.0 (paritytech/substrate#6891) is released. In case those nodes are running without issues I will prepare a pull request reverting #1532 thus the authority discovery module being enabled by default. With paritytech/substrate#6946 Prometheus will alert us once the number of file descriptors on our nodes is high.

This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants