Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beefy mpsc channels clogging after warp-sync #3945

Open
dmitry-markin opened this issue Apr 2, 2024 · 10 comments · Fixed by #4015
Open

Beefy mpsc channels clogging after warp-sync #3945

dmitry-markin opened this issue Apr 2, 2024 · 10 comments · Fixed by #4015
Assignees
Labels
D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.

Comments

@dmitry-markin
Copy link
Contributor

When warp-syncing kusama node I hit mpsc-notification-to-protocol-2-beefy clogging just after gap sync was finished:

2024-04-02 02:44:19 ⏩ Block history, #22547232 (9 peers), best: #22554774 (0x5c30…bba3), finalized #22554771 (0xae0d…61be), ⬇ 14.3MiB/s ⬆ 832.3kiB/s    
2024-04-02 02:44:24 Notification block pinning limit reached. Unpinning block with hash = 0x6c7f3c009b2a70b0b3a56c5ff2eb2ba136647bc1024a93f74684b8cfb2fb277c    
2024-04-02 02:44:24 ✨ Imported #22554775 (0x344d…d5ed)    
2024-04-02 02:44:24 ⏩ Block history, #22549344 (9 peers), best: #22554775 (0x344d…d5ed), finalized #22554772 (0xe6c4…bd9a), ⬇ 7.6MiB/s ⬆ 793.1kiB/s    
2024-04-02 02:44:24 ✨ Imported #22554775 (0x941a…868f)    
2024-04-02 02:44:25 Report 12D3KooWJhiXUbyqf4r6UrEZ4NM5JwvUYmGX8P7qiN7mnYSCcKGE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-04-02 02:44:26 Block history download is complete.    
2024-04-02 02:44:29 💤 Idle (9 peers), best: #22554775 (0x344d…d5ed), finalized #22554773 (0xec9b…c833), ⬇ 3.0MiB/s ⬆ 843.4kiB/s    
2024-04-02 02:44:30 Notification block pinning limit reached. Unpinning block with hash = 0xc4cb2b08f438b2263ef7e2ae5dd1d8dced7ef874cd76009b040c219d0d2246a6    
2024-04-02 02:44:30 ✨ Imported #22554776 (0x882a…1925)    
2024-04-02 02:44:34 💤 Idle (9 peers), best: #22554776 (0x882a…1925), finalized #22554773 (0xec9b…c833), ⬇ 1.3MiB/s ⬆ 834.8kiB/s    
2024-04-02 02:44:36 Notification block pinning limit reached. Unpinning block with hash = 0x4f3238009ca9ee8b4a9f46356148c58b438853d00a4b431de629aa4ed00fd05b    
2024-04-02 02:44:36 ✨ Imported #22554777 (0x47b2…94f5)    
2024-04-02 02:44:39 💤 Idle (9 peers), best: #22554777 (0x47b2…94f5), finalized #22554775 (0x344d…d5ed), ⬇ 1.4MiB/s ⬆ 1.0MiB/s    
2024-04-02 02:44:42 Notification block pinning limit reached. Unpinning block with hash = 0x18503eb048b773a2ffa6571af372c924591fcb1e513c2324eb5e7f24091b8673    
2024-04-02 02:44:42 ✨ Imported #22554778 (0x2bd0…737f)    
2024-04-02 02:44:43 🥩 Loading BEEFY voter state from genesis on what appears to be first startup. Starting voting rounds at block 21943872, genesis validator set ValidatorSet { <redacted out for brevity> }
2024-04-02 02:44:43 🥩 run BEEFY worker, best grandpa: #22550151.    
2024-04-02 02:44:43 🥩 Error: ConsensusReset. Restarting voter.    
2024-04-02 02:44:43 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2024-04-02 02:44:43 🥩 BEEFY pallet available: block 22553555 beefy genesis 21943872    
2024-04-02 02:44:44 💤 Idle (9 peers), best: #22554778 (0x2bd0…737f), finalized #22554775 (0x344d…d5ed), ⬇ 1.4MiB/s ⬆ 635.6kiB/s    
2024-04-02 02:44:48 Notification block pinning limit reached. Unpinning block with hash = 0xb2846c1a20be7885b15fe22a2265f5f21151d397ef004f6f3f20e72724a1b46a    
2024-04-02 02:44:48 ✨ Imported #22554779 (0x7d3e…b57b)    
2024-04-02 02:44:49 💤 Idle (9 peers), best: #22554779 (0x7d3e…b57b), finalized #22554776 (0x882a…1925), ⬇ 1.4MiB/s ⬆ 684.1kiB/s    
2024-04-02 02:44:54 Notification block pinning limit reached. Unpinning block with hash = 0x762848ae55d60fdad297acba72208bf75ac0f1a3759dcaca303bc8ee92268870    
2024-04-02 02:44:54 ✨ Imported #22554780 (0x89f5…8836)    
2024-04-02 02:44:54 Report 12D3KooWJhiXUbyqf4r6UrEZ4NM5JwvUYmGX8P7qiN7mnYSCcKGE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-04-02 02:44:54 💤 Idle (10 peers), best: #22554780 (0x89f5…8836), finalized #22554777 (0x47b2…94f5), ⬇ 1.6MiB/s ⬆ 719.7kiB/s    
2024-04-02 02:44:59 💤 Idle (10 peers), best: #22554780 (0x89f5…8836), finalized #22554777 (0x47b2…94f5), ⬇ 1.3MiB/s ⬆ 359.4kiB/s    
2024-04-02 02:45:00 Notification block pinning limit reached. Unpinning block with hash = 0xf1a59c790a810679810ca53cc510ccad6492f69c09b414b4c2a797b635e191f7    
2024-04-02 02:45:00 ✨ Imported #22554781 (0xbca3…a18d)    
2024-04-02 02:45:01 ✨ Imported #22554781 (0xe665…8722)    
2024-04-02 02:45:04 💤 Idle (10 peers), best: #22554781 (0xbca3…a18d), finalized #22554779 (0x7d3e…b57b), ⬇ 1.6MiB/s ⬆ 1.2MiB/s    
2024-04-02 02:45:06 Notification block pinning limit reached. Unpinning block with hash = 0xd752290b0a114e5ecb2af6d871bc26a2c8e9c0198d4cb143a81e79b344d8d39c    
2024-04-02 02:45:06 ✨ Imported #22554782 (0xc1ff…8619)    
2024-04-02 02:45:09 💤 Idle (10 peers), best: #22554782 (0xc1ff…8619), finalized #22554779 (0x7d3e…b57b), ⬇ 2.8MiB/s ⬆ 631.5kiB/s    
2024-04-02 02:45:12 Notification block pinning limit reached. Unpinning block with hash = 0x6fd202b3fb03d79e79fdea8b8f2dc67a446af9cda3c9091a4d9cc9487936ab17    
2024-04-02 02:45:12 ✨ Imported #22554783 (0xd8ed…8d2f)    
2024-04-02 02:45:12 ✨ Imported #22554783 (0x09bf…fe08)    
2024-04-02 02:45:14 💤 Idle (10 peers), best: #22554783 (0xd8ed…8d2f), finalized #22554780 (0x89f5…8836), ⬇ 1.7MiB/s ⬆ 756.9kiB/s    
2024-04-02 02:45:18 ✨ Imported #22554784 (0x741a…b481)    
2024-04-02 02:45:19 💤 Idle (10 peers), best: #22554784 (0x741a…b481), finalized #22554780 (0x89f5…8836), ⬇ 1.5MiB/s ⬆ 669.2kiB/s    
2024-04-02 02:45:24 Notification block pinning limit reached. Unpinning block with hash = 0x7157ec115f0d86b416144a311ca6c4b38a0ec2f1ab3f797a1d8756ddaca327cb    
2024-04-02 02:45:24 ✨ Imported #22554785 (0xc631…c9d9)    
2024-04-02 02:45:24 💤 Idle (11 peers), best: #22554785 (0xc631…c9d9), finalized #22554781 (0xbca3…a18d), ⬇ 1.7MiB/s ⬆ 896.0kiB/s    
2024-04-02 02:45:25 Report 12D3KooWJhiXUbyqf4r6UrEZ4NM5JwvUYmGX8P7qiN7mnYSCcKGE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-04-02 02:45:29 Notification block pinning limit reached. Unpinning block with hash = 0x0171fa638feb19b2961c840e1d38a997326790976f3bad7a7076d61646fa0206    
2024-04-02 02:45:29 💤 Idle (10 peers), best: #22554785 (0xc631…c9d9), finalized #22554783 (0xd8ed…8d2f), ⬇ 1.6MiB/s ⬆ 480.1kiB/s    
2024-04-02 02:45:30 Notification block pinning limit reached. Unpinning block with hash = 0x8308c089d30c4a3c45c07da4185355a1c3f0bfea293b79d6611ae9422ecc4b81    
2024-04-02 02:45:30 ✨ Imported #22554786 (0x04e6…f3ee)    
2024-04-02 02:45:34 💤 Idle (10 peers), best: #22554786 (0x04e6…f3ee), finalized #22554784 (0x741a…b481), ⬇ 1.4MiB/s ⬆ 716.4kiB/s    
2024-04-02 02:45:36 Notification block pinning limit reached. Unpinning block with hash = 0x18b0c1a190eb92b33f06360bd03ee572ad0f7ea5a0b6c3a77bae1078c8768f0a    
2024-04-02 02:45:36 ✨ Imported #22554787 (0x8de0…b2ad)    
2024-04-02 02:45:39 💤 Idle (10 peers), best: #22554787 (0x8de0…b2ad), finalized #22554784 (0x741a…b481), ⬇ 1.2MiB/s ⬆ 51.1kiB/s    
2024-04-02 02:45:42 Notification block pinning limit reached. Unpinning block with hash = 0x9a042edb5808132c49e70b502f312f883bd2fe0f69c39192850c4967d23d6dde    
2024-04-02 02:45:42 ✨ Imported #22554788 (0xd539…b864)    
2024-04-02 02:45:44 💤 Idle (9 peers), best: #22554788 (0xd539…b864), finalized #22554785 (0xc631…c9d9), ⬇ 1.6MiB/s ⬆ 772.1kiB/s    
2024-04-02 02:45:48 Notification block pinning limit reached. Unpinning block with hash = 0xc1af19ca9fbcb4b21e35610cf1e7388e0f4ad86d0d88cc0c00e7528d298fa2c1    
2024-04-02 02:45:48 ✨ Imported #22554789 (0xa340…5985)    
2024-04-02 02:45:49 💤 Idle (10 peers), best: #22554789 (0xa340…5985), finalized #22554786 (0x04e6…f3ee), ⬇ 1.5MiB/s ⬆ 902.6kiB/s
2024-04-02 02:45:50 The number of unprocessed messages in channel `mpsc-notification-to-protocol-2-beefy` exceeded 100000.
The channel was created at:
   0: sc_utils::mpsc::tracing_unbounded
   1: sc_network::protocol::notifications::service::notification_service
   2: sc_network::config::NonDefaultSetConfig::new
   3: sc_consensus_beefy::communication::beefy_peers_set_config
   4: polkadot_service::new_full
   5: polkadot_service::build_full
   6: polkadot_cli::command::run_node_inner::{{closure}}::{{closure}}
   7: tokio::runtime::runtime::Runtime::block_on
   8: polkadot_cli::command::run
   9: polkadot::main
  10: std::sys_common::backtrace::__rust_begin_short_backtrace
  11: std::rt::lang_start::{{closure}}
  12: std::rt::lang_start_internal
  13: main
  14: <unknown>
  15: __libc_start_main
  16: _start


 Last message was sent from:
   0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
   1: alloc::vec::Vec<T,A>::retain
   2: sc_network::protocol::notifications::service::ProtocolHandle::report_notification_received
   3: sc_network::service::NetworkWorker<B,H>::handle_swarm_event
   4: <futures_util::future::future::fuse::Fuse<Fut> as core::future::future::Future>::poll
   5: sc_service::build_network_future::{{closure}}::{{closure}}::{{closure}}
   6: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
   7: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
   8: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
   9: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
  10: tokio::runtime::context::runtime::enter_runtime
  11: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  12: tokio::runtime::task::core::Core<T,S>::poll
  13: tokio::runtime::task::harness::Harness<T,S>::poll
  14: tokio::runtime::blocking::pool::Inner::run
  15: std::sys_common::backtrace::__rust_begin_short_backtrace
  16: core::ops::function::FnOnce::call_once{{vtable.shim}}
  17: std::sys::pal::unix::thread::Thread::new::thread_start
  18: <unknown>
  19: <unknown>

In 15 mins mpsc_beefy_gossip_validator was also clogged:

2024-04-02 03:03:06 ✨ Imported #22554962 (0x26e6…b5cc)    
2024-04-02 03:03:06 ♻️  Reorg on #22554962,0x26e6…b5cc to #22554962,0x364c…474a, common ancestor #22554961,0x1809…a491    
2024-04-02 03:03:06 ✨ Imported #22554962 (0x364c…474a)    
2024-04-02 03:03:09 💤 Idle (10 peers), best: #22554962 (0x364c…474a), finalized #22554958 (0xc80c…ef33), ⬇ 1.4MiB/s ⬆ 733.2kiB/s    
2024-04-02 03:03:12 ✨ Imported #22554963 (0x6033…fc54)    
2024-04-02 03:03:14 💤 Idle (12 peers), best: #22554963 (0x6033…fc54), finalized #22554960 (0x484c…3721), ⬇ 1.4MiB/s ⬆ 634.1kiB/s    
2024-04-02 03:03:17 Report 12D3KooWJhiXUbyqf4r6UrEZ4NM5JwvUYmGX8P7qiN7mnYSCcKGE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3485 with session_start 22550420    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3486 with session_start 22551018    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3487 with session_start 22551617    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3488 with session_start 22552215    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3489 with session_start 22552815    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3490 with session_start 22553413    
2024-04-02 03:03:17 🥩 run BEEFY worker, best grandpa: #22553689.    
2024-04-02 03:03:17 🥩 Error: ConsensusReset. Restarting voter.
2024-04-02 03:03:18 The number of unprocessed messages in channel `mpsc_beefy_gossip_validator` exceeded 100000.
The channel was created at:
   0: sc_utils::mpsc::tracing_unbounded
   1: sc_consensus_beefy::communication::gossip::GossipValidator<B>::new
   2: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
   3: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   4: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
   5: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
   6: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
   7: tokio::runtime::context::runtime::enter_runtime
   8: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
   9: tokio::runtime::task::core::Core<T,S>::poll
  10: tokio::runtime::task::harness::Harness<T,S>::poll
  11: tokio::runtime::blocking::pool::Inner::run
  12: std::sys_common::backtrace::__rust_begin_short_backtrace
  13: core::ops::function::FnOnce::call_once{{vtable.shim}}
  14: std::sys::pal::unix::thread::Thread::new::thread_start
  15: <unknown>
  16: <unknown>


 Last message was sent from:
   0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
   1: <sc_consensus_beefy::communication::gossip::GossipValidator<B> as sc_network_gossip::validator::Validator<B>>::validate
   2: sc_network_gossip::state_machine::ConsensusGossip<B>::on_incoming
   3: <sc_network_gossip::bridge::GossipEngine<B> as core::future::future::Future>::poll
   4: sc_consensus_beefy::start_beefy_gadget::{{closure}}::{{closure}}::{{closure}}
   5: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
   6: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
   7: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   8: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
   9: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
  10: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
  11: tokio::runtime::context::runtime::enter_runtime
  12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  13: tokio::runtime::task::core::Core<T,S>::poll
  14: tokio::runtime::task::harness::Harness<T,S>::poll
  15: tokio::runtime::blocking::pool::Inner::run
  16: std::sys_common::backtrace::__rust_begin_short_backtrace
  17: core::ops::function::FnOnce::call_once{{vtable.shim}}
  18: std::sys::pal::unix::thread::Thread::new::thread_start
  19: <unknown>
  20: <unknown>

Complete logs: beefy_clogging.txt.
The node was started as: ./polkadot -d=. --pruning=1000 --sync=warp --chain=kusama 2> stderr.txt.

CC @acatangiu

@dmitry-markin dmitry-markin added T0-node This PR/Issue is related to the topic “node”. I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Apr 2, 2024
@dmitry-markin
Copy link
Contributor Author

I see similar issue has already happened in the past: #3390

@dmitry-markin dmitry-markin changed the title Beefy mpsc channels clogging Beefy mpsc channels clogging after warp-sync Apr 2, 2024
@acatangiu
Copy link
Contributor

There’s no bandwidth to investigate this in bridges team now. I suggest someone from @paritytech/sdk-node take it.
(Maybe they can fix it for good as they have more gossip protocol experience)

@lexnv
Copy link
Contributor

lexnv commented Apr 5, 2024

Extra data points here

2024-04-05 01:21:55.019 ERROR tokio-runtime-worker sc_utils::mpsc: The number of unprocessed messages in channel `mpsc-notification-to-protocol` exceeded 100000.
The channel was created at:
   0: sc_utils::mpsc::tracing_unbounded
   1: sc_network::protocol::notifications::service::notification_service
   2: sc_network::config::NonDefaultSetConfig::new
   3: sc_consensus_beefy::communication::beefy_peers_set_config
   4: polkadot_service::new_full
   5: polkadot_service::build_full
   6: polkadot_cli::command::run_node_inner::{{closure}}::{{closure}}
   7: tokio::runtime::park::CachedParkThread::block_on
   8: tokio::runtime::runtime::Runtime::block_on
   9: polkadot_cli::command::run
  10: polkadot::main
  11: std::sys_common::backtrace::__rust_begin_short_backtrace
  12: std::rt::lang_start::{{closure}}
  13: std::rt::lang_start_internal
  14: main
  15: <unknown>
  16: __libc_start_main
  17: _start


 Last message was sent from:
   0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
   1: alloc::vec::Vec<T,A>::retain
   2: sc_network::protocol::notifications::service::ProtocolHandle::report_notification_received
   3: sc_network::service::NetworkWorker<B,H>::handle_swarm_event
   4: <sc_network::service::NetworkWorker<B,H> as sc_network::service::traits::NetworkBackend<B,H>>::run::{{closure}}
   5: sc_service::build_network_future::{{closure}}::{{closure}}::{{closure}}
   6: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
   7: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
   8: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
   9: tokio::runtime::park::CachedParkThread::block_on
  10: tokio::runtime::task::core::Core<T,S>::poll
  11: tokio::runtime::task::harness::Harness<T,S>::poll
  12: tokio::runtime::blocking::pool::Inner::run
  13: std::sys_common::backtrace::__rust_begin_short_backtrace
  14: core::ops::function::FnOnce::call_once{{vtable.shim}}
  15: std::sys::pal::unix::thread::Thread::new::thread_start
  16: <unknown>
  17: <unknown>

bkchr added a commit that referenced this issue Apr 6, 2024
The stream was just used to communicate from the validator the peer
reports back to the gossip engine. Internally the gossip engine just
forwards these reports to the networking engine. So, we can just do this
directly.

The reporting stream was also pumped [in the worker behind the
engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939).
This means if there was a lot of data incoming over the engine, the
reporting stream was almost never processed and thus, it could have
started to grow and we have seen issues around this.

Closes: #3945
@bkchr
Copy link
Member

bkchr commented Apr 6, 2024

Should be fixed by this: #4015

I think we can do even better and not even have the gossip engine running at all when we are syncing/waiting for the pallet. The protocol should only be registered when the node is ready to handle the messages.

@acatangiu
Copy link
Contributor

Should be fixed by this: #4015

I think we can do even better and not even have the gossip engine running at all when we are syncing/waiting for the pallet. The protocol should only be registered when the node is ready to handle the messages.

I tried doing this at some point, but had problems with the tests (spawning multiple BEEFY workers in same test) - registering the gossip engine “later” the workers would not see each other and gossip correctly - again, this was only a tests problem, in real case with single BEEFY worker task per node instance it worked fine.

@bkchr
Copy link
Member

bkchr commented Apr 8, 2024

I just realized that @dmitry-markin posted above to error messages 🙈 So, yeah, my pr is only solving the second one. For the first one we probably need to do what I said above about activating the networking stuff of beefy only when we need it.

github-merge-queue bot pushed a commit that referenced this issue Apr 8, 2024
The stream was just used to communicate from the validator the peer
reports back to the gossip engine. Internally the gossip engine just
forwards these reports to the networking engine. So, we can just do this
directly.

The reporting stream was also pumped [in the worker behind the
engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939).
This means if there was a lot of data incoming over the engine, the
reporting stream was almost never processed and thus, it could have
started to grow and we have seen issues around this.

Partly Closes: #3945
@lexnv lexnv reopened this Apr 8, 2024
s0me0ne-unkn0wn pushed a commit that referenced this issue Apr 8, 2024
The stream was just used to communicate from the validator the peer
reports back to the gossip engine. Internally the gossip engine just
forwards these reports to the networking engine. So, we can just do this
directly.

The reporting stream was also pumped [in the worker behind the
engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939).
This means if there was a lot of data incoming over the engine, the
reporting stream was almost never processed and thus, it could have
started to grow and we have seen issues around this.

Partly Closes: #3945
@bkchr bkchr added the D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. label Apr 8, 2024
Ank4n pushed a commit that referenced this issue Apr 9, 2024
The stream was just used to communicate from the validator the peer
reports back to the gossip engine. Internally the gossip engine just
forwards these reports to the networking engine. So, we can just do this
directly.

The reporting stream was also pumped [in the worker behind the
engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939).
This means if there was a lot of data incoming over the engine, the
reporting stream was almost never processed and thus, it could have
started to grow and we have seen issues around this.

Partly Closes: #3945
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this issue Apr 9, 2024
The stream was just used to communicate from the validator the peer
reports back to the gossip engine. Internally the gossip engine just
forwards these reports to the networking engine. So, we can just do this
directly.

The reporting stream was also pumped [in the worker behind the
engine](https://github.com/paritytech/polkadot-sdk/blob/9d6261892814fa27c97881c0321c008d7340b54b/substrate/client/consensus/beefy/src/worker.rs#L939).
This means if there was a lot of data incoming over the engine, the
reporting stream was almost never processed and thus, it could have
started to grow and we have seen issues around this.

Partly Closes: paritytech#3945
@serban300
Copy link
Contributor

I would be interested in taking a look at this when I have time. I will add myself to the issue, but if anyone else can prioritize it, please feel free to take it.

@serban300
Copy link
Contributor

Looked a bit on this. I thought it was only about creating the GossipEngine at a later stage, but it's more complicated than that. We would need to register/activate the beefy gossiping protocol after the networks starts. This would involve some changes to the networking code.

Anyway, linking the branch with my experiments in order to not forget about it: https://github.com/serban300/polkadot-sdk/tree/beefy-gossip-lazy-init

@alpay-falconx
Copy link

Hey @serban300 - any new findings? I see a similar issue when running a subtensor node on version v1.1.4.

@serban300
Copy link
Contributor

serban300 commented Nov 14, 2024

Hi @alpay-falconx ! No news since the last update. Didn't have bandwidth to prioritize this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D1-medium Can be fixed by a coder with good Rust knowledge but little knowledge of the codebase. I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

6 participants