-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beefy mpsc channels clogging after warp-sync #3945
Comments
I see similar issue has already happened in the past: #3390 |
There’s no bandwidth to investigate this in bridges team now. I suggest someone from @paritytech/sdk-node take it. |
Extra data points here 2024-04-05 01:21:55.019 ERROR tokio-runtime-worker sc_utils::mpsc: The number of unprocessed messages in channel `mpsc-notification-to-protocol` exceeded 100000.
The channel was created at:
0: sc_utils::mpsc::tracing_unbounded
1: sc_network::protocol::notifications::service::notification_service
2: sc_network::config::NonDefaultSetConfig::new
3: sc_consensus_beefy::communication::beefy_peers_set_config
4: polkadot_service::new_full
5: polkadot_service::build_full
6: polkadot_cli::command::run_node_inner::{{closure}}::{{closure}}
7: tokio::runtime::park::CachedParkThread::block_on
8: tokio::runtime::runtime::Runtime::block_on
9: polkadot_cli::command::run
10: polkadot::main
11: std::sys_common::backtrace::__rust_begin_short_backtrace
12: std::rt::lang_start::{{closure}}
13: std::rt::lang_start_internal
14: main
15: <unknown>
16: __libc_start_main
17: _start
Last message was sent from:
0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
1: alloc::vec::Vec<T,A>::retain
2: sc_network::protocol::notifications::service::ProtocolHandle::report_notification_received
3: sc_network::service::NetworkWorker<B,H>::handle_swarm_event
4: <sc_network::service::NetworkWorker<B,H> as sc_network::service::traits::NetworkBackend<B,H>>::run::{{closure}}
5: sc_service::build_network_future::{{closure}}::{{closure}}::{{closure}}
6: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
7: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
8: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
9: tokio::runtime::park::CachedParkThread::block_on
10: tokio::runtime::task::core::Core<T,S>::poll
11: tokio::runtime::task::harness::Harness<T,S>::poll
12: tokio::runtime::blocking::pool::Inner::run
13: std::sys_common::backtrace::__rust_begin_short_backtrace
14: core::ops::function::FnOnce::call_once{{vtable.shim}}
15: std::sys::pal::unix::thread::Thread::new::thread_start
16: <unknown>
17: <unknown>
|
The stream was just used to communicate from the validator the peer reports back to the gossip engine. Internally the gossip engine just forwards these reports to the networking engine. So, we can just do this directly. The reporting stream was also pumped [in the worker behind the engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939). This means if there was a lot of data incoming over the engine, the reporting stream was almost never processed and thus, it could have started to grow and we have seen issues around this. Closes: #3945
Should be fixed by this: #4015 I think we can do even better and not even have the gossip engine running at all when we are syncing/waiting for the pallet. The protocol should only be registered when the node is ready to handle the messages. |
I tried doing this at some point, but had problems with the tests (spawning multiple BEEFY workers in same test) - registering the gossip engine “later” the workers would not see each other and gossip correctly - again, this was only a tests problem, in real case with single BEEFY worker task per node instance it worked fine. |
I just realized that @dmitry-markin posted above to error messages 🙈 So, yeah, my pr is only solving the second one. For the first one we probably need to do what I said above about activating the networking stuff of beefy only when we need it. |
The stream was just used to communicate from the validator the peer reports back to the gossip engine. Internally the gossip engine just forwards these reports to the networking engine. So, we can just do this directly. The reporting stream was also pumped [in the worker behind the engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939). This means if there was a lot of data incoming over the engine, the reporting stream was almost never processed and thus, it could have started to grow and we have seen issues around this. Partly Closes: #3945
The stream was just used to communicate from the validator the peer reports back to the gossip engine. Internally the gossip engine just forwards these reports to the networking engine. So, we can just do this directly. The reporting stream was also pumped [in the worker behind the engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939). This means if there was a lot of data incoming over the engine, the reporting stream was almost never processed and thus, it could have started to grow and we have seen issues around this. Partly Closes: #3945
The stream was just used to communicate from the validator the peer reports back to the gossip engine. Internally the gossip engine just forwards these reports to the networking engine. So, we can just do this directly. The reporting stream was also pumped [in the worker behind the engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939). This means if there was a lot of data incoming over the engine, the reporting stream was almost never processed and thus, it could have started to grow and we have seen issues around this. Partly Closes: #3945
The stream was just used to communicate from the validator the peer reports back to the gossip engine. Internally the gossip engine just forwards these reports to the networking engine. So, we can just do this directly. The reporting stream was also pumped [in the worker behind the engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939). This means if there was a lot of data incoming over the engine, the reporting stream was almost never processed and thus, it could have started to grow and we have seen issues around this. Partly Closes: paritytech#3945
I would be interested in taking a look at this when I have time. I will add myself to the issue, but if anyone else can prioritize it, please feel free to take it. |
Looked a bit on this. I thought it was only about creating the Anyway, linking the branch with my experiments in order to not forget about it: https://github.com/serban300/polkadot-sdk/tree/beefy-gossip-lazy-init |
Hey @serban300 - any new findings? I see a similar issue when running a subtensor node on version |
Hi @alpay-falconx ! No news since the last update. Didn't have bandwidth to prioritize this issue |
When warp-syncing kusama node I hit
mpsc-notification-to-protocol-2-beefy
clogging just after gap sync was finished:In 15 mins
mpsc_beefy_gossip_validator
was also clogged:Complete logs: beefy_clogging.txt.
The node was started as:
./polkadot -d=. --pruning=1000 --sync=warp --chain=kusama 2> stderr.txt
.CC @acatangiu
The text was updated successfully, but these errors were encountered: