Cumulus sync performance #9360

arkpar · 2021-07-16T07:56:47Z

When the cumulus node start syncing from genesis or significantly behind the tip of the network, it syncs two chains in parallel. Best block and finality of the parachain depends on the state of the relay chain, so behaviour during the sync depends on which of the two chains syncs faster.

When the parachain syncs and blocks are imported with NetworkInitialSync origin, the head is moved to the latest imported block regardless of the relay chain state. But then, there's also some logic to set the head based on the relay chain block once it is imported. So if the relay chain is importing behind the parachain, the head is moved backwards and forwards by thousands of blocks, leading to huge DB updates.
When the relay chain is falling behind the parachain, finality of the parachain is also falling behind. Long unfinalized chains lead to more performance issues in the database. On each import the tree route to the last finalized block is queried from the db. When there are 100000 such blocks, it slows down import dramatically.

The text was updated successfully, but these errors were encountered:

jasl · 2023-04-12T10:51:56Z

I think we might meet this bug,
TD; LR, we try to sync a full new node, when syncing progress reaches about half, the speed (bps) drops to 0.x bps,
but we never see this when we restore the relay chain from a snapshot

altonen · 2023-04-12T11:05:09Z

Are you experiencing this only recently? We did some refactoring for the syncing implementation and those changes don't play too well with other open issues with sc-network which could possibly contribute to lower performance. I will be looking at this issue soon.

jasl · 2023-04-12T11:16:43Z

Are you experiencing this only recently? We did some refactoring for the syncing implementation and those changes don't play too well with other open issues with sc-network which could possibly contribute to lower performance. I will be looking at this issue soon.

actually, we often got the reports, but we can't reproduce them... so we just assume this is a hardware or network issue

Normally we restore Relaychain DB from a snapshot

Until recently, my colleague and I both sync new nodes, all from 0, and we meet the issue.

My colleague rough debug:

but when nearly 50% (i think 45 or some) syncing just dead halts

15442 read ops, 58 write op

so its wrting ~30MB/s but kill the io buffer

if i randomly write from the system, i can get 18GB/s easily

so its writing 30MB/s, thats 105GB/hours

after 1 hour only 41GB was written

so 60% of the writes were wasted

also using ANY kind of raid amplifies the issue

e.g. raid 5/6

but raid 0, 1, 10 is fine .... with raid 0 (striping) the issues are less worrysome

As for me, I don't see visible issue on my local server, but my Azure VM has the same problem, I guess my local server has Dell flagship RAID card that helps me mitigate the problem, but unluck for the Azure VM, the sync speed from ~500bps to ~0bps

jasl · 2023-04-18T08:51:13Z

I tried to sync Polkadot first, then Phala on that Azure VM (4 cores Xeon 2288G, 16G, and 3T premium SSD),

the sync was fast, I can see Phala sync > 800 bps, but a few hours later

it slow now

2023-04-18 08:49:22 [Parachain] ⚙️  Syncing  0.4 bps, target=#3720177 (40 peers), best: #886790 (0x727f…2c41), finalized #0 (0xd435…be8d), ⬇ 33.4kiB/s ⬆ 6.0kiB/s
2023-04-18 08:49:22 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17527415 (40 peers), best: #17296990 (0x85f1…9b9e), finalized #17296896 (0xf29b…4d53), ⬇ 362.6kiB/s ⬆ 219.1kiB/s

altonen · 2023-04-18T08:53:11Z

We have gotten a lot of reports like this and have likely identified the issue. It should be fixed in the next release.

jasl · 2023-04-18T08:55:15Z

We have gotten a lot of reports like this and have likely identified the issue. It should be fixed in the next release.

Does the next release mean 0.9.42? Could you share the related PR?

altonen · 2023-04-18T08:56:25Z

Yes, #13829

jasl · 2023-04-25T21:10:33Z

Yes, #13829

I tried patching that PR to our node, but it seems no help for syncing speed, maybe there are other issues?

2023-04-25 21:08:43 [Parachain] ⚙️  Syncing  0.4 bps, target=#3770937 (40 peers), best: #819461 (0x3a7f…97a2), finalized #0 (0xd435…be8d), ⬇ 11.8kiB/s ⬆ 5.7kiB/s
2023-04-25 21:08:47 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17641352 (40 peers), best: #3339776 (0x0994…abc9), finalized #3339776 (0x0994…abc9), ⬇ 387.3kiB/s ⬆ 361.9kiB/s
2023-04-25 21:08:48 [Parachain] ⚙️  Syncing  0.0 bps, target=#3770937 (40 peers), best: #819461 (0x3a7f…97a2), finalized #0 (0xd435…be8d), ⬇ 36.4kiB/s ⬆ 5.1kiB/s
2023-04-25 21:08:52 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17641353 (40 peers), best: #3339776 (0x0994…abc9), finalized #3339776 (0x0994…abc9), ⬇ 394.3kiB/s ⬆ 355.7kiB/s
2023-04-25 21:08:53 [Parachain] ⚙️  Syncing  0.8 bps, target=#3770937 (40 peers), best: #819465 (0xd333…9efb), finalized #0 (0xd435…be8d), ⬇ 32.2kiB/s ⬆ 8.5kiB/s

altonen · 2023-04-25T22:29:09Z

Can you run the node with -lsync=trace

jasl · 2023-04-26T00:37:37Z

Can you run the node with -lsync=trace

khala-node.log.zip

Uploaded

It seems too many

2023-04-26 00:29:53.162 DEBUG tokio-runtime-worker sync: [Relaychain] Too many full nodes, rejecting 12D3KooWLqUHk2Xj1rW6A4CGDnoP3mFBKnDrpzh7SjJBZ68UsS8D
2023-04-26 00:29:53.162 DEBUG tokio-runtime-worker sync: [Relaychain] Failed to register peer PeerId("12D3KooWLqUHk2Xj1rW6A4CGDnoP3mFBKnDrpzh7SjJBZ68UsS8D"): BlockAnnouncesHandshake { roles: FULL, best_number: 17643353, best_hash: 0xa8fcec5c8ef0679cb52e3b9ac22ee49533023b8524a7a237869c5e517cf91535, genesis_hash: 0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe }
2023-04-26 00:29:53.162 TRACE tokio-runtime-worker sync: [Relaychain] Disconnected peer which had earlier been refused by on_sync_peer_connected 12D3KooWGFneh4LCQi4gVUSPcwzWLcdz3a9gBFcMh2xeicB3PAt9
2023-04-26 00:29:53.168 TRACE tokio-runtime-worker sync: [Parachain] 12D3KooWDwrsuC2ZCbJt5BPaFtLpCcNiWpa8y1aGNxJgfAcLboR9 Ignoring transactions while major syncing
2023-04-26 00:29:53.168 TRACE tokio-runtime-worker sync: [Parachain] 12D3KooWRRMxmEoUJmqqPJr7PbDy1XyTzSQDm8JK3SVhUwVqBQkh Ignoring transactions while major syncing
2023-04-26 00:29:53.170 TRACE tokio-runtime-worker sync: [Relaychain] New peer 12D3KooWABrnESXvjBok2hfPv7SoPrRFpfbfzWftePT76QMV22Sh BlockAnnouncesHandshake { roles: AUTHORITY, best_number: 17643353, best_hash: 0xa8fcec5c8ef0679cb52e3b9ac22ee49533023b8524a7a237869c5e517cf91535, genesis_hash: 0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe }
2023-04-26 00:29:53.170 DEBUG tokio-runtime-worker sync: [Relaychain] Too many full nodes, rejecting 12D3KooWABrnESXvjBok2hfPv7SoPrRFpfbfzWftePT76QMV22Sh
2023-04-26 00:29:53.170 DEBUG tokio-runtime-worker sync: [Relaychain] Failed to register peer PeerId("12D3KooWABrnESXvjBok2hfPv7SoPrRFpfbfzWftePT76QMV22Sh"): BlockAnnouncesHandshake { roles: AUTHORITY, best_number: 17643353, best_hash: 0xa8fcec5c8ef0679cb52e3b9ac22ee49533023b8524a7a237869c5e517cf91535, genesis_hash: 0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe }

altonen · 2023-04-26T07:22:26Z

It looks like there is something wrong with block import. It says "too many blocks in the queue" which prevents ChainSync from downloading new blocks. Interestingly both the relaychain and parachain import blocks so it looks like ChainSync::on_blocks_processed() is not getting called which prevents it from making progress.

If I give you a patch with some more logging, are you able to apply that to your node and run it again with trace logging?

jasl · 2023-04-26T07:48:21Z

It looks like there is something wrong with block import. It says "too many blocks in the queue" which prevents ChainSync from downloading new blocks. Interestingly both the relaychain and parachain import blocks so it looks like ChainSync::on_blocks_processed() is not getting called which prevents it from making progress.

If I give you a patch with some more logging, are you able to apply that to your node and run it again with trace logging?

No problem!

BTW, our code based on Polkadot v0.9.41, and I cherry-picked #13829 and #13941

altonen · 2023-04-26T08:45:08Z

Here's the patch: logging.txt, cheers! It should reveal if the communication between the import queue and syncing is lagging. I'll see if I can reproduce this locally as well.

But there seems to be something wrong with the code that decides if a block request is sent to peer:

2023-04-26 00:25:49.112 sync: [Relaychain] Finished block announce validation: from PeerId("12D3KooWJBv3FnJyGbqdkzRD3eAkckNkhE2wo4Up3D7tQj6dwdkF"): AnnouncementSummary { block_hash: 0xfe252d7f72be5f9c86bb4b18ac64edd12d486dd10e97260fe2477ea58ece8dd3, number: 17643313, parent_hash: 0x91afc0c125fc26ac549081a204db733bceab525698268ce74612095b318cbe56, state: Some(Normal) }. local_best=false    
...
2023-04-26 00:25:49.283 sync: [Relaychain] Too far ahead for peer 12D3KooWJBv3FnJyGbqdkzRD3eAkckNkhE2wo4Up3D7tQj6dwdkF (3491983)

If locally I think the peer's best block is 17643313, then I should be able to download block 3491983 from them.

@arkpar Does this look wrong to you or am I misunderstanding something about the block request scheduler?

arkpar · 2023-04-26T08:52:14Z

It looks like there is something wrong with block import. It says "too many blocks in the queue" which prevents ChainSync from downloading new blocks. Interestingly both the relaychain and parachain import blocks so it looks like ChainSync::on_blocks_processed() is not getting called which prevents it from making progress.

This probably means that the block import is slow for whatever reason. I'm not familiar with phala and how computationaly heavy their blocks supposed to be. Profiling would help. Or at least a stack dump taken with eu-stack or similar tool.

Btw, this has nothing to do with the original issue description.

altonen · 2023-04-26T11:49:49Z

I was looking at the logs again and it looks like the relay chain block import is getting stuck (or at least taking a long time).

It schedules 64 blocks for import and then imports a few of them

2023-04-26 00:34:42.191 sync::import-queue: [Relaychain] Scheduling 64 blocks for import

but when it has imported block 3499005 and starts importing the block with header 0x1eda..., it stops and for the rest of the logs (~20 seconds), import queue doesn't log anything for relay chain and it looks like it's stuck.

2023-04-26 00:34:42.562 sync::import-queue: [Relaychain] Block imported successfully Some(3499005) (0xc1b2…ede4)    
2023-04-26 00:34:42.562 sync::import-queue: [Relaychain] Header 0x1eda…31cb has 2 logs

jasl · 2023-04-26T14:52:39Z

Here's the patch: logging.txt, cheers! It should reveal if the communication between the import queue and syncing is lagging. I'll see if I can reproduce this locally as well.

But there seems to be something wrong with the code that decides if a block request is sent to peer:
2023-04-26 00:25:49.112 sync: [Relaychain] Finished block announce validation: from PeerId("12D3KooWJBv3FnJyGbqdkzRD3eAkckNkhE2wo4Up3D7tQj6dwdkF"): AnnouncementSummary { block_hash: 0xfe252d7f72be5f9c86bb4b18ac64edd12d486dd10e97260fe2477ea58ece8dd3, number: 17643313, parent_hash: 0x91afc0c125fc26ac549081a204db733bceab525698268ce74612095b318cbe56, state: Some(Normal) }. local_best=false    
...
2023-04-26 00:25:49.283 sync: [Relaychain] Too far ahead for peer 12D3KooWJBv3FnJyGbqdkzRD3eAkckNkhE2wo4Up3D7tQj6dwdkF (3491983)    
If locally I think the peer's best block is 17643313, then I should be able to download block 3491983 from them.

@arkpar Does this look wrong to you or am I misunderstanding something about the block request scheduler?

khala-node.log.zip

Here's the new log, please take a look

jasl · 2023-04-26T15:12:23Z

I can try another sync, the performance will regress when the progress is reach ~50%, even the node doesn't trigger the critical problem, it still will reduce to < 50bps

I'm suspecting --state-pruning archive-canonical casue the issue, our usage requires the node must not prune states

altonen · 2023-04-26T15:18:19Z

Could you try syncing again without the flag?

jasl · 2023-04-26T15:21:37Z

Sure, I'll try upload a new log from the node start, then remove --state-pruning archive-canonical and retry

Should I dump log after the sync speed start slowing?

jasl · 2023-04-26T15:29:22Z

khala-node.log

This is the log from start

altonen · 2023-04-26T15:38:21Z

Sure, I'll try upload a new log from the node start, then remove --state-pruning archive-canonical and retry

Should I dump log after the sync speed start slowing?

Looking at the logs you provided, the issue doesn't seem to be communication between syncing and import queue but something else so I'm not sure it makes a difference whether you capture them from the start or only after the sync has started slowing down.

jasl · 2023-04-27T11:53:25Z

Another new sync on Azure VM, paritydb with default state-pruning and block-pruning, after nearly a day sync,
the speed reduces to

2023-04-27 11:51:49 [Parachain] ⚙️  Syncing 17.2 bps, target=#3780877 (40 peers), best: #2481802 (0xb67c…a1fb), finalized #0 (0xd435…be8d), ⬇ 36.3kiB/s ⬆ 11.2kiB/s
2023-04-27 11:51:50 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17664467 (40 peers), best: #7540943 (0x8cf2…4e73), finalized #7540736 (0x2100…f186), ⬇ 280.0kiB/s ⬆ 277.3kiB/s

Slight better than archive-canonical but definitely not right

Any idea how to profile the node?

jasl · 2023-04-28T16:26:18Z

After 2 days, the node syncing speed reduce to ~0 bps again, so pruning mode is not related

Any idea that how to debug / profile this case?

altonen · 2023-04-30T08:38:53Z

Can you update this issue instead: paritytech/polkadot-sdk#13

Somebody is probably able to help guide you with profiling

arkpar added the I3-bug The node fails to follow expected behavior. label Jul 16, 2021

altonen added the U2-some_time_soon Issue is worth doing soon. label Apr 12, 2023

jasl mentioned this issue Apr 16, 2023

Externalities not allowed to fail within runtime: "Trie lookup error: Database missing expected key #13864

Closed

2 tasks

jasl mentioned this issue Apr 30, 2023

Slow block import paritytech/polkadot-sdk#13

Closed

2 tasks

arkpar changed the title ~~Sync performance~~ Cumulus sync performance May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cumulus sync performance #9360

Cumulus sync performance #9360

arkpar commented Jul 16, 2021

jasl commented Apr 12, 2023 •

edited

Loading

altonen commented Apr 12, 2023

jasl commented Apr 12, 2023 •

edited

Loading

jasl commented Apr 18, 2023

altonen commented Apr 18, 2023

jasl commented Apr 18, 2023

altonen commented Apr 18, 2023

jasl commented Apr 25, 2023

altonen commented Apr 25, 2023

jasl commented Apr 26, 2023

altonen commented Apr 26, 2023 •

edited

Loading

jasl commented Apr 26, 2023

altonen commented Apr 26, 2023

arkpar commented Apr 26, 2023 •

edited

Loading

altonen commented Apr 26, 2023

jasl commented Apr 26, 2023

jasl commented Apr 26, 2023

altonen commented Apr 26, 2023

jasl commented Apr 26, 2023

jasl commented Apr 26, 2023

altonen commented Apr 26, 2023

jasl commented Apr 27, 2023

jasl commented Apr 28, 2023

altonen commented Apr 30, 2023

Cumulus sync performance #9360

Cumulus sync performance #9360

Comments

arkpar commented Jul 16, 2021

jasl commented Apr 12, 2023 • edited Loading

altonen commented Apr 12, 2023

jasl commented Apr 12, 2023 • edited Loading

jasl commented Apr 18, 2023

altonen commented Apr 18, 2023

jasl commented Apr 18, 2023

altonen commented Apr 18, 2023

jasl commented Apr 25, 2023

altonen commented Apr 25, 2023

jasl commented Apr 26, 2023

altonen commented Apr 26, 2023 • edited Loading

jasl commented Apr 26, 2023

altonen commented Apr 26, 2023

arkpar commented Apr 26, 2023 • edited Loading

altonen commented Apr 26, 2023

jasl commented Apr 26, 2023

jasl commented Apr 26, 2023

altonen commented Apr 26, 2023

jasl commented Apr 26, 2023

jasl commented Apr 26, 2023

altonen commented Apr 26, 2023

jasl commented Apr 27, 2023

jasl commented Apr 28, 2023

altonen commented Apr 30, 2023

jasl commented Apr 12, 2023 •

edited

Loading

jasl commented Apr 12, 2023 •

edited

Loading

altonen commented Apr 26, 2023 •

edited

Loading

arkpar commented Apr 26, 2023 •

edited

Loading