Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sc-consensus-beefy: pump gossip engine while waiting for initialization conditions #3435

Merged
merged 15 commits into from
Feb 22, 2024

Conversation

acatangiu
Copy link
Contributor

As part of BEEFY worker/voter initialization the task waits for certain chain and backend conditions to be fulfilled:

  • BEEFY consensus enabled on-chain & GRANDPA best finalized higher than on-chain BEEFY genesis block,
  • backend has synced headers for BEEFY mandatory blocks between best BEEFY and best GRANDPA.

During this waiting time, any messages gossiped on the BEEFY topic for current chain get enqueued in the gossip engine, leading to RAM bloating and output warning/error messages when the wait time is non-negligible (like during a clean sync).

This PR adds logic to pump the gossip engine while waiting for other things to make sure gossiped messages get consumed (practically discarded until worker is fully initialized).

Also raises the warning threshold for enqueued messages from 10k to 100k. This is in line with the other gossip protocols on the node.

Fixes #3390

This new threshold is inline with the thresholds of the other gossip protocols.
On Kusama where there are 1000 validators, each gossiping at least
one message per round, the 10k threshold gets hit during node restarts when
there is an expected lag between gossip engine initialization and BEEFY
worker initialization. During this lag, more than 10k messages build up
and an error message is put up. With this commit, that annoying message goes
away.

Signed-off-by: Adrian Catangiu <adrian@parity.io>
@acatangiu acatangiu added the T0-node This PR/Issue is related to the topic “node”. label Feb 21, 2024
@acatangiu acatangiu requested review from a team February 21, 2024 17:33
@acatangiu acatangiu self-assigned this Feb 21, 2024
Copy link
Contributor

@svyatonik svyatonik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, but perhaps it worth to leave a couple of "why?" comments everywhere where we are using that new Action::Ignore. Something like: "The worker is not initialized yet, so messages may still be valid and we shall not punish senders". Or even better (which may result in a worse code though): add explicit fn is_worker_active() function to Filter and if it is active, return Action::Ignore and otherwise change reputation.

Without it, we may revert it accidentally in the future

Comment on lines +565 to +573
// Pump peer reports
_ = &mut beefy_comms.gossip_report_stream.next() => {
continue
},
// Pump gossip engine.
_ = &mut beefy_comms.gossip_engine => {
error!(target: LOG_TARGET, "🥩 Gossip engine has unexpectedly terminated.");
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could just initialize the BeefyComms including the gossip engine and gossip report stream after BeefyWorkerBuilder::async_initialize finished in order to avoid pumping them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a few days trying to make that happen a few months ago, as it should work in theory, but in practice it breaks integration tests.

I've not seen any problems in local testing when running local nodes, but integration tests that are using sc_network_test are not finding local peers when I'd initialize GossipEngine after waiting for BEEFY genesis. I couldn't get to the bottom of it and dropped the idea.

I'd be very happy if someone could make it work :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Would it make sense to add a separate issue for this ?

@acatangiu acatangiu added this pull request to the merge queue Feb 22, 2024
Merged via the queue into paritytech:master with commit 31546c8 Feb 22, 2024
125 of 129 checks passed
@acatangiu acatangiu deleted the fix-beefy-gossip-error-message branch February 22, 2024 12:11
EgorPopelyaev pushed a commit that referenced this pull request Feb 22, 2024
…on conditions (#3435)

As part of BEEFY worker/voter initialization the task waits for certain
chain and backend conditions to be fulfilled:
- BEEFY consensus enabled on-chain & GRANDPA best finalized higher than
on-chain BEEFY genesis block,
- backend has synced headers for BEEFY mandatory blocks between best
BEEFY and best GRANDPA.

During this waiting time, any messages gossiped on the BEEFY topic for
current chain get enqueued in the gossip engine, leading to RAM bloating
and output warning/error messages when the wait time is non-negligible
(like during a clean sync).

This PR adds logic to pump the gossip engine while waiting for other
things to make sure gossiped messages get consumed (practically
discarded until worker is fully initialized).

Also raises the warning threshold for enqueued messages from 10k to
100k. This is in line with the other gossip protocols on the node.

Fixes #3390

---------

Signed-off-by: Adrian Catangiu <adrian@parity.io>
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this pull request Mar 25, 2024
…on conditions (paritytech#3435)

As part of BEEFY worker/voter initialization the task waits for certain
chain and backend conditions to be fulfilled:
- BEEFY consensus enabled on-chain & GRANDPA best finalized higher than
on-chain BEEFY genesis block,
- backend has synced headers for BEEFY mandatory blocks between best
BEEFY and best GRANDPA.

During this waiting time, any messages gossiped on the BEEFY topic for
current chain get enqueued in the gossip engine, leading to RAM bloating
and output warning/error messages when the wait time is non-negligible
(like during a clean sync).

This PR adds logic to pump the gossip engine while waiting for other
things to make sure gossiped messages get consumed (practically
discarded until worker is fully initialized).

Also raises the warning threshold for enqueued messages from 10k to
100k. This is in line with the other gossip protocols on the node.

Fixes paritytech#3390

---------

Signed-off-by: Adrian Catangiu <adrian@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kusama: Beefy gossip validator is processing messages too slowly
3 participants