sc-consensus-slots tries to claim slot before fully synced #11510

nazar-pc · 2022-05-23T13:57:59Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

After Substrate upgrade from c4f3d02 to 5d3e7c4 I have noticed that sc-consensus-slots tries to claim slot before it is fully synced (might be unrelated to upgrade, just haven't seen this before).

In particular this happens when node is ~2000 blocks from the sync target:

2022-05-23 16:39:09 [PrimaryChain] ⚙️  Syncing  2.7 bps, target=#6399 (6 peers), best: #4373 (0x67e2…e5ab), finalized #0 (0xe6ec…2830), ⬇ 6.4MiB/s ⬆ 1.8kiB/s    
2022-05-23 16:39:14 [PrimaryChain] ⚙️  Syncing  2.8 bps, target=#6404 (6 peers), best: #4387 (0xbf3d…d0e0), finalized #0 (0xe6ec…2830), ⬇ 6.6MiB/s ⬆ 1.7kiB/s    
2022-05-23 16:39:19 [PrimaryChain] 🚜 Claimed block at slot 1653313159    
2022-05-23 16:39:19 [PrimaryChain] 🙌 Starting consensus session on top of parent 0x2dbdaa63e4e35943398f26034226215483b5305dac70a6346157652c6fed79c6    
2022-05-23 16:39:19 [PrimaryChain] 🎁 Prepared block for proposing at 4402 (0 ms) [hash: 0xa8cfe6a69fb710943b2fe4e43d67d2b7f514e6a6baef7fa5d316beebe5f6e1ef; parent_hash: 0x2dbd…79c6; extrinsics (1): [0xccc7…4f6f]]    
2022-05-23 16:39:19 [PrimaryChain] 🔖 Pre-sealed block for proposal at 4402. Hash now 0x52229917d4f78624128b2c664b32430655599814254c6ed68c62aaa8f6c3b740, previously 0xa8cfe6a69fb710943b2fe4e43d67d2b7f514e6a6baef7fa5d316beebe5f6e1ef.    
2022-05-23 16:39:19 [PrimaryChain] ✨ Imported #4402 (0x5222…b740)    
2022-05-23 16:39:19 [PrimaryChain] ♻️  Reorg on #4402,0x5222…b740 to #4403,0x3a78…a34e, common ancestor #4401,0x2dbd…79c6    
2022-05-23 16:39:19 [PrimaryChain] ✨ Imported #4403 (0x3a78…a34e)    
2022-05-23 16:39:19 [PrimaryChain] 💤 Idle (6 peers), best: #4403 (0x3a78…a34e), finalized #0 (0xe6ec…2830), ⬇ 5.5MiB/s ⬆ 16.9kiB/s    
2022-05-23 16:39:20 [PrimaryChain] 🚜 Claimed block at slot 1653313160    
2022-05-23 16:39:20 [PrimaryChain] 🙌 Starting consensus session on top of parent 0x3a784924b83bad99b213ac6d67fa34a7754fce689439a76d7cf347ffd6a5a34e    
2022-05-23 16:39:20 [PrimaryChain] 🎁 Prepared block for proposing at 4404 (0 ms) [hash: 0x7be8f6c26c515e5aa73c120ffc7036c4d72b26443d76c39bbd6a9e4c09ea956f; parent_hash: 0x3a78…a34e; extrinsics (1): [0xd569…606d]]    
2022-05-23 16:39:20 [PrimaryChain] 🔖 Pre-sealed block for proposal at 4404. Hash now 0x2586106376330740f3369d12ba6588e5bdd47088aa09af676cf3730a5948ceaf, previously 0x7be8f6c26c515e5aa73c120ffc7036c4d72b26443d76c39bbd6a9e4c09ea956f.    
2022-05-23 16:39:20 [PrimaryChain] ✨ Imported #4404 (0x2586…ceaf)    
2022-05-23 16:39:20 [PrimaryChain] ♻️  Reorg on #4404,0x2586…ceaf to #4404,0xf711…022d, common ancestor #4403,0x3a78…a34e    
2022-05-23 16:39:20 [PrimaryChain] ✨ Imported #4404 (0xf711…022d)

Whatever the reason of this is, it seems wrong and I'm not convinced right now that the root cause is in our consensus. I have not yet tried this with just Babe and Substrate's node template, this is our custom consensus protocol, but it is using sc-consensus-slot in a similar way to Babe consensus.

Node was started with --validator --pruning archive, forced authoring wasn't enabled.

Steps to reproduce

Have a few thousand blocks to sync
Wait for node to sync to ~2000 blocks from the tip
Observe node trying to claim slots

The text was updated successfully, but these errors were encountered:

bkchr · 2022-05-24T08:10:51Z

Hmm, so we should actually skip it:
https://github.com/paritytech/substrate/blob/master/client/consensus/slots/src/lib.rs#L495-L498

nazar-pc · 2022-05-24T14:21:08Z

Hm, but it does seem to skip already, isn't it?

Ah, I see what you mean. Yeah, doesn't make sense to me either 🤔

nazar-pc · 2022-05-24T15:27:34Z

According to @liuchengxu we found .is_major_syncing() to not be reliable in other context as well, so there might be something there to investigate.

nazar-pc · 2022-05-24T18:33:50Z

I think the tricky part is that is_major_syncing is determined like this:

substrate/client/network/src/service.rs

Lines 2104 to 2108 in 58fd5b7

    
           let is_major_syncing = 
        
           	match this.network_service.behaviour_mut().user_protocol_mut().sync_state().state { 
        
           		SyncState::Idle => false, 
        
           		SyncState::Downloading => true, 
        
           	};

While sync state like this:

substrate/client/network/sync/src/lib.rs

Lines 583 to 590 in c2fc4b3

    
           // A chain is classified as downloading if the provided best block is 
        
           // more than `MAJOR_SYNC_BLOCKS` behind the best queued block. 
        
           if n > self.best_queued_number && n - self.best_queued_number > MAJOR_SYNC_BLOCKS.into() 
        
           { 
        
           	SyncState::Downloading 
        
           } else { 
        
           	SyncState::Idle 
        
           }

And turns out max number of blocks queued for importing is around the number I saw:

substrate/client/network/sync/src/lib.rs

Lines 85 to 86 in c2fc4b3

    
           /// Maximum blocks to store in the import queue. 
        
           const MAX_IMPORTING_BLOCKS: usize = 2048;

So is_major_syncing should probably be derived from min_queued_number rather than best_queued_number to work properly.

bkchr · 2022-05-24T19:29:29Z

MAJOR_SYNC_BLOCKS is 5. So, if there is a peer more than 5 blocks ahead of us, we switch to major sync mode.

nazar-pc · 2022-05-24T19:41:26Z

I'm reading it as follows:

is_major_sync = best_peer_number - best_queued_number > 5
best_queued_number - client.info().best_number <= 2048

Hence best_number can be up to 2048+5 blocks away from the actual network, yet would already try to claim slots.

liuchengxu · 2022-05-25T00:48:35Z

Yeah, I can provide more evidence. Use this subspace branch with the following debugging patch, let the first node produce a number of blocks and then spin up another node, as you can see, the second node is connect to the first node, the best seen block is 183, the local best number is 1, but the sync state is Idle and it's not major syncing, so it's all confusing to me.

diff --git a/cumulus/client/cirrus-executor/src/bundle_processor.rs b/cumulus/client/cirrus-executor/src/bundle_processor.rs
index e30d52fc5..3d0da16d0 100644
--- a/cumulus/client/cirrus-executor/src/bundle_processor.rs
+++ b/cumulus/client/cirrus-executor/src/bundle_processor.rs
@@ -297,6 +297,19 @@ where
                        return Ok(())
                }

+               let network_status = self.primary_network.status().await.expect("Failed to get network status");
+
+               println!(
+                       "==== sync_state: {:?}, best_seen_block: {:?}, num_sync_peers: {:?}, num_connected_peers: {:?}, num_active_peers: {:?}, best_number: {:?}, is_major_syncing: {:?}",
+                       network_status.sync_state,
+                       network_status.best_seen_block,
+                       network_status.num_sync_peers,
+                       network_status.num_connected_peers,
+                       network_status.num_active_peers,
+                       self.primary_chain_client.info().best_number,
+                       self.primary_network.is_major_syncing()
+               );
+
                // Ideally, the receipt of current block will be included in the next block, i.e., no
                // missing receipts.
                if header_number == best_execution_chain_number + One::one() {

2022-05-25 08:38:57.263  INFO main sc_cli::runner: Subspace    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: ✌️  version 0.1.0-106fd2f1a30    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: ❤️  by Subspace Labs <https://subspace.network>, 2021-2022    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: 📋 Chain specification: Subspace development    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: 🏷  Node name: adventurous-basket-1009    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: 👤 Role: FULL    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: 💾 Database: RocksDb at tmp/chains/subspace_dev/db/full    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: ⛓  Native runtime: subspace-0 (subspace-0.tx0.au0)    
2022-05-25 08:38:57.459  INFO main sc_service::client::client: [PrimaryChain] 🔨 Initializing Genesis block/state (state: 0x40f2…5ea7, header-hash: 0xe841…1563)    
2022-05-25 08:38:57.623  INFO main subspace: [PrimaryChain] Starting archiving from genesis    
2022-05-25 08:38:57.648  INFO main subspace: [PrimaryChain] Archiving already produced blocks 0..=0    
2022-05-25 08:38:57.717  WARN main sc_service::config: [PrimaryChain] Using default protocol ID "sup" because none is configured in the chain specs    
2022-05-25 08:38:57.717  INFO main sub-libp2p: [PrimaryChain] 🏷  Local node identity is: 12D3KooWCjz3CbDY65FiJiJqQNtTzudeNDGvKbvBgQu2YE4JWBMp    
2022-05-25 08:38:57.719  INFO main subspace: [PrimaryChain] 🧑‍🌾 Starting Subspace Authorship worker    
2022-05-25 08:38:57.719  INFO main sc_sysinfo: [PrimaryChain] 💻 Operating system: macos    
2022-05-25 08:38:57.719  INFO main sc_sysinfo: [PrimaryChain] 💻 CPU architecture: aarch64    
2022-05-25 08:38:57.719  INFO main sc_service::builder: [PrimaryChain] 📦 Highest known block at #0    
2022-05-25 08:38:57.721  INFO main sc_rpc_server: [PrimaryChain] Running JSON-RPC HTTP server: addr=127.0.0.1:58617, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])    
2022-05-25 08:38:57.721  INFO main sc_rpc_server: [PrimaryChain] Running JSON-RPC WS server: addr=127.0.0.1:9000, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])    
2022-05-25 08:38:57.721  WARN main sc_cli::commands::run_cmd: [SecondaryChain] Running in --dev mode, RPC CORS has been disabled.    
2022-05-25 08:38:57.825  INFO main sc_service::client::client: [SecondaryChain] 🔨 Initializing Genesis block/state (state: 0xb40d…c52f, header-hash: 0x750d…9987)    
2022-05-25 08:38:57.826  WARN main sc_service::config: [SecondaryChain] Using default protocol ID "sup" because none is configured in the chain specs    
2022-05-25 08:38:57.826  INFO main sub-libp2p: [SecondaryChain] 🏷  Local node identity is: 12D3KooWSYPG5HxkWJXoCPFCzShhU2BrFqNzxzdbwNV1EUrUqtze    
2022-05-25 08:38:57.915  INFO main sc_sysinfo: [SecondaryChain] 💻 Operating system: macos    
2022-05-25 08:38:57.915  INFO main sc_sysinfo: [SecondaryChain] 💻 CPU architecture: aarch64    
2022-05-25 08:38:57.915  INFO main sc_service::builder: [SecondaryChain] 📦 Highest known block at #0    
2022-05-25 08:38:57.915  INFO tokio-runtime-worker substrate_prometheus_endpoint: [SecondaryChain] 〽️ Prometheus exporter started at 127.0.0.1:9616    
2022-05-25 08:38:57.915  INFO                 main sc_rpc_server: [SecondaryChain] Running JSON-RPC HTTP server: addr=127.0.0.1:8845, allowed origins=None    
2022-05-25 08:38:57.915  INFO                 main sc_rpc_server: [SecondaryChain] Running JSON-RPC WS server: addr=127.0.0.1:8846, allowed origins=None    
2022-05-25 08:38:57.917  INFO tokio-runtime-worker libp2p_mdns::behaviour::iface: [PrimaryChain] creating instance on iface 192.168.0.39    
==== sync_state: Idle, best_seen_block: Some(183), num_sync_peers: 1, num_connected_peers: 1, num_active_peers: 0, best_number: 1, is_major_syncing: false
==== sync_state: Idle, best_seen_block: Some(183), num_sync_peers: 1, num_connected_peers: 1, num_active_peers: 0, best_number: 2, is_major_syncing: false
2

nazar-pc · 2022-05-31T18:13:18Z

We actually hit a fairly pathological case today with the launch of the new network. We had a few bootstrap nodes with full history available (some 32k blocks) and hundreds of other nodes that try to sync from them.

Not only nodes are trying to solve prematurely as described above when they approach the top, I think there is another issue that should or maybe shouldn't be solved that depends on the sync speed and connectivity:

node A is connected to archival node with all the history and is at best block N with target M (M >> N)
another node B is not connected to any archival nodes directly and is syncing from other nodes as they get blocks (at least this is how understanding it right now)
then node B that is ~2k blocks away from N (which is their target, they are not connected to anyone advertising M) tries to solve too
if node C is only connected to nodes like B, it will try to solve even earlier than that, making things even worse

So I'm wondering if node should advertise its target somehow as well? To explain to other nodes that it itself isn't done syncing yet and that should also be accounted for somehow during calculation of is_major_sync.

nazar-pc · 2022-08-25T22:42:53Z

On latest Substrate it still considers major sync to be over when 2k blocks are queued among other issues 😞

I believe this is because of median > self.best_queued_number in #11547, it should have been just best imported number for the purposes of is_major_syncing at least.

Current master:

2022-08-26 01:28:40 [PrimaryChain] Is major syncing: true    
2022-08-26 01:28:41 [PrimaryChain] Is major syncing: true    
2022-08-26 01:28:41 [PrimaryChain] ⚙️  Syncing 186.2 bps, target=#16494 (2 peers), best: #14266 (0x1f1f…43db), finalized #14166 (0x9dc2…9893), ⬇ 2.1MiB/s ⬆ 1.4kiB/s    
2022-08-26 01:28:42 [PrimaryChain] Is major syncing: true    
2022-08-26 01:28:43 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:44 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:45 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:46 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:46 [PrimaryChain] 💤 Idle (2 peers), best: #15065 (0x882a…bac5), finalized #14964 (0x4f5e…4989), ⬇ 1.3MiB/s ⬆ 2.2kiB/s    
2022-08-26 01:28:47 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:48 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:49 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:49 [PrimaryChain] ✨ Imported #15544 (0x1726…3a69)    
2022-08-26 01:28:49 [PrimaryChain] ✨ Imported #15545 (0xadd3…f7c2)

Should I submit a PR leaving behavior of SyncStatus::best_seen_block as is, but tweaking is_major_syncing?

Also I believe this should be reopened, since the issue is clearly not fixed, node will try to claim slots long before it is actually synced.

bkchr mentioned this issue May 31, 2022

Sync: Improve major sync detection #11547

Merged

nazar-pc mentioned this issue May 31, 2022

Strange sync of blocks of much lower height autonomys/subspace#526

Closed

gavofyork closed this as completed in #11547 May 31, 2022

nazar-pc mentioned this issue Aug 25, 2022

Actually fix major sync detection #12114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sc-consensus-slots tries to claim slot before fully synced #11510

sc-consensus-slots tries to claim slot before fully synced #11510

nazar-pc commented May 23, 2022

bkchr commented May 24, 2022

nazar-pc commented May 24, 2022 •

edited

Loading

nazar-pc commented May 24, 2022

nazar-pc commented May 24, 2022

bkchr commented May 24, 2022

nazar-pc commented May 24, 2022

liuchengxu commented May 25, 2022

nazar-pc commented May 31, 2022

nazar-pc commented Aug 25, 2022

sc-consensus-slots tries to claim slot before fully synced #11510

sc-consensus-slots tries to claim slot before fully synced #11510

Comments

nazar-pc commented May 23, 2022

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

bkchr commented May 24, 2022

nazar-pc commented May 24, 2022 • edited Loading

nazar-pc commented May 24, 2022

nazar-pc commented May 24, 2022

bkchr commented May 24, 2022

nazar-pc commented May 24, 2022

liuchengxu commented May 25, 2022

nazar-pc commented May 31, 2022

nazar-pc commented Aug 25, 2022

nazar-pc commented May 24, 2022 •

edited

Loading