Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

sc-consensus-slots tries to claim slot before fully synced #11510

Closed
2 tasks done
nazar-pc opened this issue May 23, 2022 · 9 comments · Fixed by #11547 or #12114
Closed
2 tasks done

sc-consensus-slots tries to claim slot before fully synced #11510

nazar-pc opened this issue May 23, 2022 · 9 comments · Fixed by #11547 or #12114

Comments

@nazar-pc
Copy link
Contributor

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

After Substrate upgrade from c4f3d02 to 5d3e7c4 I have noticed that sc-consensus-slots tries to claim slot before it is fully synced (might be unrelated to upgrade, just haven't seen this before).

In particular this happens when node is ~2000 blocks from the sync target:

2022-05-23 16:39:09 [PrimaryChain] ⚙️  Syncing  2.7 bps, target=#6399 (6 peers), best: #4373 (0x67e2…e5ab), finalized #0 (0xe6ec…2830), ⬇ 6.4MiB/s ⬆ 1.8kiB/s    
2022-05-23 16:39:14 [PrimaryChain] ⚙️  Syncing  2.8 bps, target=#6404 (6 peers), best: #4387 (0xbf3d…d0e0), finalized #0 (0xe6ec…2830), ⬇ 6.6MiB/s ⬆ 1.7kiB/s    
2022-05-23 16:39:19 [PrimaryChain] 🚜 Claimed block at slot 1653313159    
2022-05-23 16:39:19 [PrimaryChain] 🙌 Starting consensus session on top of parent 0x2dbdaa63e4e35943398f26034226215483b5305dac70a6346157652c6fed79c6    
2022-05-23 16:39:19 [PrimaryChain] 🎁 Prepared block for proposing at 4402 (0 ms) [hash: 0xa8cfe6a69fb710943b2fe4e43d67d2b7f514e6a6baef7fa5d316beebe5f6e1ef; parent_hash: 0x2dbd…79c6; extrinsics (1): [0xccc7…4f6f]]    
2022-05-23 16:39:19 [PrimaryChain] 🔖 Pre-sealed block for proposal at 4402. Hash now 0x52229917d4f78624128b2c664b32430655599814254c6ed68c62aaa8f6c3b740, previously 0xa8cfe6a69fb710943b2fe4e43d67d2b7f514e6a6baef7fa5d316beebe5f6e1ef.    
2022-05-23 16:39:19 [PrimaryChain] ✨ Imported #4402 (0x5222…b740)    
2022-05-23 16:39:19 [PrimaryChain] ♻️  Reorg on #4402,0x5222…b740 to #4403,0x3a78…a34e, common ancestor #4401,0x2dbd…79c6    
2022-05-23 16:39:19 [PrimaryChain] ✨ Imported #4403 (0x3a78…a34e)    
2022-05-23 16:39:19 [PrimaryChain] 💤 Idle (6 peers), best: #4403 (0x3a78…a34e), finalized #0 (0xe6ec…2830), ⬇ 5.5MiB/s ⬆ 16.9kiB/s    
2022-05-23 16:39:20 [PrimaryChain] 🚜 Claimed block at slot 1653313160    
2022-05-23 16:39:20 [PrimaryChain] 🙌 Starting consensus session on top of parent 0x3a784924b83bad99b213ac6d67fa34a7754fce689439a76d7cf347ffd6a5a34e    
2022-05-23 16:39:20 [PrimaryChain] 🎁 Prepared block for proposing at 4404 (0 ms) [hash: 0x7be8f6c26c515e5aa73c120ffc7036c4d72b26443d76c39bbd6a9e4c09ea956f; parent_hash: 0x3a78…a34e; extrinsics (1): [0xd569…606d]]    
2022-05-23 16:39:20 [PrimaryChain] 🔖 Pre-sealed block for proposal at 4404. Hash now 0x2586106376330740f3369d12ba6588e5bdd47088aa09af676cf3730a5948ceaf, previously 0x7be8f6c26c515e5aa73c120ffc7036c4d72b26443d76c39bbd6a9e4c09ea956f.    
2022-05-23 16:39:20 [PrimaryChain] ✨ Imported #4404 (0x2586…ceaf)    
2022-05-23 16:39:20 [PrimaryChain] ♻️  Reorg on #4404,0x2586…ceaf to #4404,0xf711…022d, common ancestor #4403,0x3a78…a34e    
2022-05-23 16:39:20 [PrimaryChain] ✨ Imported #4404 (0xf711…022d)

Whatever the reason of this is, it seems wrong and I'm not convinced right now that the root cause is in our consensus. I have not yet tried this with just Babe and Substrate's node template, this is our custom consensus protocol, but it is using sc-consensus-slot in a similar way to Babe consensus.

Node was started with --validator --pruning archive, forced authoring wasn't enabled.

Steps to reproduce

  • Have a few thousand blocks to sync
  • Wait for node to sync to ~2000 blocks from the tip
  • Observe node trying to claim slots
@bkchr
Copy link
Member

bkchr commented May 24, 2022

@nazar-pc
Copy link
Contributor Author

nazar-pc commented May 24, 2022

Hm, but it does seem to skip already, isn't it?

Ah, I see what you mean. Yeah, doesn't make sense to me either 🤔

@nazar-pc
Copy link
Contributor Author

According to @liuchengxu we found .is_major_syncing() to not be reliable in other context as well, so there might be something there to investigate.

@nazar-pc
Copy link
Contributor Author

I think the tricky part is that is_major_syncing is determined like this:

let is_major_syncing =
match this.network_service.behaviour_mut().user_protocol_mut().sync_state().state {
SyncState::Idle => false,
SyncState::Downloading => true,
};

While sync state like this:

// A chain is classified as downloading if the provided best block is
// more than `MAJOR_SYNC_BLOCKS` behind the best queued block.
if n > self.best_queued_number && n - self.best_queued_number > MAJOR_SYNC_BLOCKS.into()
{
SyncState::Downloading
} else {
SyncState::Idle
}

And turns out max number of blocks queued for importing is around the number I saw:

/// Maximum blocks to store in the import queue.
const MAX_IMPORTING_BLOCKS: usize = 2048;

So is_major_syncing should probably be derived from min_queued_number rather than best_queued_number to work properly.

@bkchr
Copy link
Member

bkchr commented May 24, 2022

MAJOR_SYNC_BLOCKS is 5. So, if there is a peer more than 5 blocks ahead of us, we switch to major sync mode.

@nazar-pc
Copy link
Contributor Author

I'm reading it as follows:

  • is_major_sync = best_peer_number - best_queued_number > 5
  • best_queued_number - client.info().best_number <= 2048

Hence best_number can be up to 2048+5 blocks away from the actual network, yet would already try to claim slots.

@liuchengxu
Copy link
Contributor

Yeah, I can provide more evidence. Use this subspace branch with the following debugging patch, let the first node produce a number of blocks and then spin up another node, as you can see, the second node is connect to the first node, the best seen block is 183, the local best number is 1, but the sync state is Idle and it's not major syncing, so it's all confusing to me.

diff --git a/cumulus/client/cirrus-executor/src/bundle_processor.rs b/cumulus/client/cirrus-executor/src/bundle_processor.rs
index e30d52fc5..3d0da16d0 100644
--- a/cumulus/client/cirrus-executor/src/bundle_processor.rs
+++ b/cumulus/client/cirrus-executor/src/bundle_processor.rs
@@ -297,6 +297,19 @@ where
                        return Ok(())
                }

+               let network_status = self.primary_network.status().await.expect("Failed to get network status");
+
+               println!(
+                       "==== sync_state: {:?}, best_seen_block: {:?}, num_sync_peers: {:?}, num_connected_peers: {:?}, num_active_peers: {:?}, best_number: {:?}, is_major_syncing: {:?}",
+                       network_status.sync_state,
+                       network_status.best_seen_block,
+                       network_status.num_sync_peers,
+                       network_status.num_connected_peers,
+                       network_status.num_active_peers,
+                       self.primary_chain_client.info().best_number,
+                       self.primary_network.is_major_syncing()
+               );
+
                // Ideally, the receipt of current block will be included in the next block, i.e., no
                // missing receipts.
                if header_number == best_execution_chain_number + One::one() {
2022-05-25 08:38:57.263  INFO main sc_cli::runner: Subspace    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: ✌️  version 0.1.0-106fd2f1a30    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: ❤️  by Subspace Labs <https://subspace.network>, 2021-2022    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: 📋 Chain specification: Subspace development    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: 🏷  Node name: adventurous-basket-1009    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: 👤 Role: FULL    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: 💾 Database: RocksDb at tmp/chains/subspace_dev/db/full    
2022-05-25 08:38:57.263  INFO main sc_cli::runner: ⛓  Native runtime: subspace-0 (subspace-0.tx0.au0)    
2022-05-25 08:38:57.459  INFO main sc_service::client::client: [PrimaryChain] 🔨 Initializing Genesis block/state (state: 0x40f2…5ea7, header-hash: 0xe841…1563)    
2022-05-25 08:38:57.623  INFO main subspace: [PrimaryChain] Starting archiving from genesis    
2022-05-25 08:38:57.648  INFO main subspace: [PrimaryChain] Archiving already produced blocks 0..=0    
2022-05-25 08:38:57.717  WARN main sc_service::config: [PrimaryChain] Using default protocol ID "sup" because none is configured in the chain specs    
2022-05-25 08:38:57.717  INFO main sub-libp2p: [PrimaryChain] 🏷  Local node identity is: 12D3KooWCjz3CbDY65FiJiJqQNtTzudeNDGvKbvBgQu2YE4JWBMp    
2022-05-25 08:38:57.719  INFO main subspace: [PrimaryChain] 🧑‍🌾 Starting Subspace Authorship worker    
2022-05-25 08:38:57.719  INFO main sc_sysinfo: [PrimaryChain] 💻 Operating system: macos    
2022-05-25 08:38:57.719  INFO main sc_sysinfo: [PrimaryChain] 💻 CPU architecture: aarch64    
2022-05-25 08:38:57.719  INFO main sc_service::builder: [PrimaryChain] 📦 Highest known block at #0    
2022-05-25 08:38:57.721  INFO main sc_rpc_server: [PrimaryChain] Running JSON-RPC HTTP server: addr=127.0.0.1:58617, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])    
2022-05-25 08:38:57.721  INFO main sc_rpc_server: [PrimaryChain] Running JSON-RPC WS server: addr=127.0.0.1:9000, allowed origins=Some(["http://localhost:*", "http://127.0.0.1:*", "https://localhost:*", "https://127.0.0.1:*", "https://polkadot.js.org"])    
2022-05-25 08:38:57.721  WARN main sc_cli::commands::run_cmd: [SecondaryChain] Running in --dev mode, RPC CORS has been disabled.    
2022-05-25 08:38:57.825  INFO main sc_service::client::client: [SecondaryChain] 🔨 Initializing Genesis block/state (state: 0xb40d…c52f, header-hash: 0x750d…9987)    
2022-05-25 08:38:57.826  WARN main sc_service::config: [SecondaryChain] Using default protocol ID "sup" because none is configured in the chain specs    
2022-05-25 08:38:57.826  INFO main sub-libp2p: [SecondaryChain] 🏷  Local node identity is: 12D3KooWSYPG5HxkWJXoCPFCzShhU2BrFqNzxzdbwNV1EUrUqtze    
2022-05-25 08:38:57.915  INFO main sc_sysinfo: [SecondaryChain] 💻 Operating system: macos    
2022-05-25 08:38:57.915  INFO main sc_sysinfo: [SecondaryChain] 💻 CPU architecture: aarch64    
2022-05-25 08:38:57.915  INFO main sc_service::builder: [SecondaryChain] 📦 Highest known block at #0    
2022-05-25 08:38:57.915  INFO tokio-runtime-worker substrate_prometheus_endpoint: [SecondaryChain] 〽️ Prometheus exporter started at 127.0.0.1:9616    
2022-05-25 08:38:57.915  INFO                 main sc_rpc_server: [SecondaryChain] Running JSON-RPC HTTP server: addr=127.0.0.1:8845, allowed origins=None    
2022-05-25 08:38:57.915  INFO                 main sc_rpc_server: [SecondaryChain] Running JSON-RPC WS server: addr=127.0.0.1:8846, allowed origins=None    
2022-05-25 08:38:57.917  INFO tokio-runtime-worker libp2p_mdns::behaviour::iface: [PrimaryChain] creating instance on iface 192.168.0.39    
==== sync_state: Idle, best_seen_block: Some(183), num_sync_peers: 1, num_connected_peers: 1, num_active_peers: 0, best_number: 1, is_major_syncing: false
==== sync_state: Idle, best_seen_block: Some(183), num_sync_peers: 1, num_connected_peers: 1, num_active_peers: 0, best_number: 2, is_major_syncing: false
2

@nazar-pc
Copy link
Contributor Author

We actually hit a fairly pathological case today with the launch of the new network. We had a few bootstrap nodes with full history available (some 32k blocks) and hundreds of other nodes that try to sync from them.

Not only nodes are trying to solve prematurely as described above when they approach the top, I think there is another issue that should or maybe shouldn't be solved that depends on the sync speed and connectivity:

  • node A is connected to archival node with all the history and is at best block N with target M (M >> N)
  • another node B is not connected to any archival nodes directly and is syncing from other nodes as they get blocks (at least this is how understanding it right now)
  • then node B that is ~2k blocks away from N (which is their target, they are not connected to anyone advertising M) tries to solve too
  • if node C is only connected to nodes like B, it will try to solve even earlier than that, making things even worse

So I'm wondering if node should advertise its target somehow as well? To explain to other nodes that it itself isn't done syncing yet and that should also be accounted for somehow during calculation of is_major_sync.

@nazar-pc
Copy link
Contributor Author

On latest Substrate it still considers major sync to be over when 2k blocks are queued among other issues 😞

I believe this is because of median > self.best_queued_number in #11547, it should have been just best imported number for the purposes of is_major_syncing at least.

Current master:

2022-08-26 01:28:40 [PrimaryChain] Is major syncing: true    
2022-08-26 01:28:41 [PrimaryChain] Is major syncing: true    
2022-08-26 01:28:41 [PrimaryChain] ⚙️  Syncing 186.2 bps, target=#16494 (2 peers), best: #14266 (0x1f1f…43db), finalized #14166 (0x9dc2…9893), ⬇ 2.1MiB/s ⬆ 1.4kiB/s    
2022-08-26 01:28:42 [PrimaryChain] Is major syncing: true    
2022-08-26 01:28:43 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:44 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:45 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:46 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:46 [PrimaryChain] 💤 Idle (2 peers), best: #15065 (0x882a…bac5), finalized #14964 (0x4f5e…4989), ⬇ 1.3MiB/s ⬆ 2.2kiB/s    
2022-08-26 01:28:47 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:48 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:49 [PrimaryChain] Is major syncing: false    
2022-08-26 01:28:49 [PrimaryChain] ✨ Imported #15544 (0x1726…3a69)    
2022-08-26 01:28:49 [PrimaryChain] ✨ Imported #15545 (0xadd3…f7c2)    

Should I submit a PR leaving behavior of SyncStatus::best_seen_block as is, but tweaking is_major_syncing?

Also I believe this should be reopened, since the issue is clearly not fixed, node will try to claim slots long before it is actually synced.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants