Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Cumulus sync performance #9360

Open
arkpar opened this issue Jul 16, 2021 · 24 comments
Open

Cumulus sync performance #9360

arkpar opened this issue Jul 16, 2021 · 24 comments
Labels
I3-bug The node fails to follow expected behavior. U2-some_time_soon Issue is worth doing soon.

Comments

@arkpar
Copy link
Member

arkpar commented Jul 16, 2021

When the cumulus node start syncing from genesis or significantly behind the tip of the network, it syncs two chains in parallel. Best block and finality of the parachain depends on the state of the relay chain, so behaviour during the sync depends on which of the two chains syncs faster.

  1. When the parachain syncs and blocks are imported with NetworkInitialSync origin, the head is moved to the latest imported block regardless of the relay chain state. But then, there's also some logic to set the head based on the relay chain block once it is imported. So if the relay chain is importing behind the parachain, the head is moved backwards and forwards by thousands of blocks, leading to huge DB updates.

  2. When the relay chain is falling behind the parachain, finality of the parachain is also falling behind. Long unfinalized chains lead to more performance issues in the database. On each import the tree route to the last finalized block is queried from the db. When there are 100000 such blocks, it slows down import dramatically.

@arkpar arkpar added the I3-bug The node fails to follow expected behavior. label Jul 16, 2021
@jasl
Copy link
Contributor

jasl commented Apr 12, 2023

I think we might meet this bug,
TD; LR, we try to sync a full new node, when syncing progress reaches about half, the speed (bps) drops to 0.x bps,
but we never see this when we restore the relay chain from a snapshot

@altonen
Copy link
Contributor

altonen commented Apr 12, 2023

Are you experiencing this only recently? We did some refactoring for the syncing implementation and those changes don't play too well with other open issues with sc-network which could possibly contribute to lower performance. I will be looking at this issue soon.

@altonen altonen added the U2-some_time_soon Issue is worth doing soon. label Apr 12, 2023
@jasl
Copy link
Contributor

jasl commented Apr 12, 2023

Are you experiencing this only recently? We did some refactoring for the syncing implementation and those changes don't play too well with other open issues with sc-network which could possibly contribute to lower performance. I will be looking at this issue soon.

actually, we often got the reports, but we can't reproduce them... so we just assume this is a hardware or network issue

Normally we restore Relaychain DB from a snapshot

Until recently, my colleague and I both sync new nodes, all from 0, and we meet the issue.

My colleague rough debug:

but when nearly 50% (i think 45 or some) syncing just dead halts

15442 read ops, 58 write op

so its wrting ~30MB/s but kill the io buffer

if i randomly write from the system, i can get 18GB/s easily

so its writing 30MB/s, thats 105GB/hours

after 1 hour only 41GB was written

so 60% of the writes were wasted

also using ANY kind of raid amplifies the issue

e.g. raid 5/6

but raid 0, 1, 10 is fine .... with raid 0 (striping) the issues are less worrysome

As for me, I don't see visible issue on my local server, but my Azure VM has the same problem, I guess my local server has Dell flagship RAID card that helps me mitigate the problem, but unluck for the Azure VM, the sync speed from ~500bps to ~0bps

@jasl
Copy link
Contributor

jasl commented Apr 18, 2023

I tried to sync Polkadot first, then Phala on that Azure VM (4 cores Xeon 2288G, 16G, and 3T premium SSD),

the sync was fast, I can see Phala sync > 800 bps, but a few hours later

it slow now

2023-04-18 08:49:22 [Parachain] ⚙️  Syncing  0.4 bps, target=#3720177 (40 peers), best: #886790 (0x727f…2c41), finalized #0 (0xd435…be8d), ⬇ 33.4kiB/s ⬆ 6.0kiB/s
2023-04-18 08:49:22 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17527415 (40 peers), best: #17296990 (0x85f1…9b9e), finalized #17296896 (0xf29b…4d53), ⬇ 362.6kiB/s ⬆ 219.1kiB/s

@altonen
Copy link
Contributor

altonen commented Apr 18, 2023

We have gotten a lot of reports like this and have likely identified the issue. It should be fixed in the next release.

@jasl
Copy link
Contributor

jasl commented Apr 18, 2023

We have gotten a lot of reports like this and have likely identified the issue. It should be fixed in the next release.

Does the next release mean 0.9.42? Could you share the related PR?

@altonen
Copy link
Contributor

altonen commented Apr 18, 2023

Yes, #13829

@jasl
Copy link
Contributor

jasl commented Apr 25, 2023

Yes, #13829

I tried patching that PR to our node, but it seems no help for syncing speed, maybe there are other issues?

2023-04-25 21:08:43 [Parachain] ⚙️  Syncing  0.4 bps, target=#3770937 (40 peers), best: #819461 (0x3a7f…97a2), finalized #0 (0xd435…be8d), ⬇ 11.8kiB/s ⬆ 5.7kiB/s
2023-04-25 21:08:47 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17641352 (40 peers), best: #3339776 (0x0994…abc9), finalized #3339776 (0x0994…abc9), ⬇ 387.3kiB/s ⬆ 361.9kiB/s
2023-04-25 21:08:48 [Parachain] ⚙️  Syncing  0.0 bps, target=#3770937 (40 peers), best: #819461 (0x3a7f…97a2), finalized #0 (0xd435…be8d), ⬇ 36.4kiB/s ⬆ 5.1kiB/s
2023-04-25 21:08:52 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17641353 (40 peers), best: #3339776 (0x0994…abc9), finalized #3339776 (0x0994…abc9), ⬇ 394.3kiB/s ⬆ 355.7kiB/s
2023-04-25 21:08:53 [Parachain] ⚙️  Syncing  0.8 bps, target=#3770937 (40 peers), best: #819465 (0xd333…9efb), finalized #0 (0xd435…be8d), ⬇ 32.2kiB/s ⬆ 8.5kiB/s

@altonen
Copy link
Contributor

altonen commented Apr 25, 2023

Can you run the node with -lsync=trace

@jasl
Copy link
Contributor

jasl commented Apr 26, 2023

Can you run the node with -lsync=trace

khala-node.log.zip

Uploaded

It seems too many

2023-04-26 00:29:53.162 DEBUG tokio-runtime-worker sync: [Relaychain] Too many full nodes, rejecting 12D3KooWLqUHk2Xj1rW6A4CGDnoP3mFBKnDrpzh7SjJBZ68UsS8D
2023-04-26 00:29:53.162 DEBUG tokio-runtime-worker sync: [Relaychain] Failed to register peer PeerId("12D3KooWLqUHk2Xj1rW6A4CGDnoP3mFBKnDrpzh7SjJBZ68UsS8D"): BlockAnnouncesHandshake { roles: FULL, best_number: 17643353, best_hash: 0xa8fcec5c8ef0679cb52e3b9ac22ee49533023b8524a7a237869c5e517cf91535, genesis_hash: 0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe }
2023-04-26 00:29:53.162 TRACE tokio-runtime-worker sync: [Relaychain] Disconnected peer which had earlier been refused by on_sync_peer_connected 12D3KooWGFneh4LCQi4gVUSPcwzWLcdz3a9gBFcMh2xeicB3PAt9
2023-04-26 00:29:53.168 TRACE tokio-runtime-worker sync: [Parachain] 12D3KooWDwrsuC2ZCbJt5BPaFtLpCcNiWpa8y1aGNxJgfAcLboR9 Ignoring transactions while major syncing
2023-04-26 00:29:53.168 TRACE tokio-runtime-worker sync: [Parachain] 12D3KooWRRMxmEoUJmqqPJr7PbDy1XyTzSQDm8JK3SVhUwVqBQkh Ignoring transactions while major syncing
2023-04-26 00:29:53.170 TRACE tokio-runtime-worker sync: [Relaychain] New peer 12D3KooWABrnESXvjBok2hfPv7SoPrRFpfbfzWftePT76QMV22Sh BlockAnnouncesHandshake { roles: AUTHORITY, best_number: 17643353, best_hash: 0xa8fcec5c8ef0679cb52e3b9ac22ee49533023b8524a7a237869c5e517cf91535, genesis_hash: 0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe }
2023-04-26 00:29:53.170 DEBUG tokio-runtime-worker sync: [Relaychain] Too many full nodes, rejecting 12D3KooWABrnESXvjBok2hfPv7SoPrRFpfbfzWftePT76QMV22Sh
2023-04-26 00:29:53.170 DEBUG tokio-runtime-worker sync: [Relaychain] Failed to register peer PeerId("12D3KooWABrnESXvjBok2hfPv7SoPrRFpfbfzWftePT76QMV22Sh"): BlockAnnouncesHandshake { roles: AUTHORITY, best_number: 17643353, best_hash: 0xa8fcec5c8ef0679cb52e3b9ac22ee49533023b8524a7a237869c5e517cf91535, genesis_hash: 0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe }

@altonen
Copy link
Contributor

altonen commented Apr 26, 2023

It looks like there is something wrong with block import. It says "too many blocks in the queue" which prevents ChainSync from downloading new blocks. Interestingly both the relaychain and parachain import blocks so it looks like ChainSync::on_blocks_processed() is not getting called which prevents it from making progress.

If I give you a patch with some more logging, are you able to apply that to your node and run it again with trace logging?

@jasl
Copy link
Contributor

jasl commented Apr 26, 2023

It looks like there is something wrong with block import. It says "too many blocks in the queue" which prevents ChainSync from downloading new blocks. Interestingly both the relaychain and parachain import blocks so it looks like ChainSync::on_blocks_processed() is not getting called which prevents it from making progress.

If I give you a patch with some more logging, are you able to apply that to your node and run it again with trace logging?

No problem!

BTW, our code based on Polkadot v0.9.41, and I cherry-picked #13829 and #13941

@altonen
Copy link
Contributor

altonen commented Apr 26, 2023

Here's the patch: logging.txt, cheers! It should reveal if the communication between the import queue and syncing is lagging. I'll see if I can reproduce this locally as well.

But there seems to be something wrong with the code that decides if a block request is sent to peer:

2023-04-26 00:25:49.112 sync: [Relaychain] Finished block announce validation: from PeerId("12D3KooWJBv3FnJyGbqdkzRD3eAkckNkhE2wo4Up3D7tQj6dwdkF"): AnnouncementSummary { block_hash: 0xfe252d7f72be5f9c86bb4b18ac64edd12d486dd10e97260fe2477ea58ece8dd3, number: 17643313, parent_hash: 0x91afc0c125fc26ac549081a204db733bceab525698268ce74612095b318cbe56, state: Some(Normal) }. local_best=false    
...
2023-04-26 00:25:49.283 sync: [Relaychain] Too far ahead for peer 12D3KooWJBv3FnJyGbqdkzRD3eAkckNkhE2wo4Up3D7tQj6dwdkF (3491983)    

If locally I think the peer's best block is 17643313, then I should be able to download block 3491983 from them.

@arkpar Does this look wrong to you or am I misunderstanding something about the block request scheduler?

@arkpar
Copy link
Member Author

arkpar commented Apr 26, 2023

It looks like there is something wrong with block import. It says "too many blocks in the queue" which prevents ChainSync from downloading new blocks. Interestingly both the relaychain and parachain import blocks so it looks like ChainSync::on_blocks_processed() is not getting called which prevents it from making progress.

This probably means that the block import is slow for whatever reason. I'm not familiar with phala and how computationaly heavy their blocks supposed to be. Profiling would help. Or at least a stack dump taken with eu-stack or similar tool.

Btw, this has nothing to do with the original issue description.

@altonen
Copy link
Contributor

altonen commented Apr 26, 2023

I was looking at the logs again and it looks like the relay chain block import is getting stuck (or at least taking a long time).

It schedules 64 blocks for import and then imports a few of them

2023-04-26 00:34:42.191 sync::import-queue: [Relaychain] Scheduling 64 blocks for import    

but when it has imported block 3499005 and starts importing the block with header 0x1eda..., it stops and for the rest of the logs (~20 seconds), import queue doesn't log anything for relay chain and it looks like it's stuck.

2023-04-26 00:34:42.562 sync::import-queue: [Relaychain] Block imported successfully Some(3499005) (0xc1b2…ede4)    
2023-04-26 00:34:42.562 sync::import-queue: [Relaychain] Header 0x1eda…31cb has 2 logs    

@jasl
Copy link
Contributor

jasl commented Apr 26, 2023

Here's the patch: logging.txt, cheers! It should reveal if the communication between the import queue and syncing is lagging. I'll see if I can reproduce this locally as well.

But there seems to be something wrong with the code that decides if a block request is sent to peer:

2023-04-26 00:25:49.112 sync: [Relaychain] Finished block announce validation: from PeerId("12D3KooWJBv3FnJyGbqdkzRD3eAkckNkhE2wo4Up3D7tQj6dwdkF"): AnnouncementSummary { block_hash: 0xfe252d7f72be5f9c86bb4b18ac64edd12d486dd10e97260fe2477ea58ece8dd3, number: 17643313, parent_hash: 0x91afc0c125fc26ac549081a204db733bceab525698268ce74612095b318cbe56, state: Some(Normal) }. local_best=false    
...
2023-04-26 00:25:49.283 sync: [Relaychain] Too far ahead for peer 12D3KooWJBv3FnJyGbqdkzRD3eAkckNkhE2wo4Up3D7tQj6dwdkF (3491983)    

If locally I think the peer's best block is 17643313, then I should be able to download block 3491983 from them.

@arkpar Does this look wrong to you or am I misunderstanding something about the block request scheduler?

khala-node.log.zip

Here's the new log, please take a look

@jasl
Copy link
Contributor

jasl commented Apr 26, 2023

I can try another sync, the performance will regress when the progress is reach ~50%, even the node doesn't trigger the critical problem, it still will reduce to < 50bps

I'm suspecting --state-pruning archive-canonical casue the issue, our usage requires the node must not prune states

@altonen
Copy link
Contributor

altonen commented Apr 26, 2023

Could you try syncing again without the flag?

@jasl
Copy link
Contributor

jasl commented Apr 26, 2023

Sure, I'll try upload a new log from the node start, then remove --state-pruning archive-canonical and retry

Should I dump log after the sync speed start slowing?

@jasl
Copy link
Contributor

jasl commented Apr 26, 2023

khala-node.log

This is the log from start

@altonen
Copy link
Contributor

altonen commented Apr 26, 2023

Sure, I'll try upload a new log from the node start, then remove --state-pruning archive-canonical and retry

Should I dump log after the sync speed start slowing?

Looking at the logs you provided, the issue doesn't seem to be communication between syncing and import queue but something else so I'm not sure it makes a difference whether you capture them from the start or only after the sync has started slowing down.

@jasl
Copy link
Contributor

jasl commented Apr 27, 2023

Another new sync on Azure VM, paritydb with default state-pruning and block-pruning, after nearly a day sync,
the speed reduces to

2023-04-27 11:51:49 [Parachain] ⚙️  Syncing 17.2 bps, target=#3780877 (40 peers), best: #2481802 (0xb67c…a1fb), finalized #0 (0xd435…be8d), ⬇ 36.3kiB/s ⬆ 11.2kiB/s
2023-04-27 11:51:50 [Relaychain] ⚙️  Syncing  0.0 bps, target=#17664467 (40 peers), best: #7540943 (0x8cf2…4e73), finalized #7540736 (0x2100…f186), ⬇ 280.0kiB/s ⬆ 277.3kiB/s

Slight better than archive-canonical but definitely not right

Any idea how to profile the node?

@jasl
Copy link
Contributor

jasl commented Apr 28, 2023

After 2 days, the node syncing speed reduce to ~0 bps again, so pruning mode is not related

Any idea that how to debug / profile this case?

@altonen
Copy link
Contributor

altonen commented Apr 30, 2023

Can you update this issue instead: paritytech/polkadot-sdk#13

Somebody is probably able to help guide you with profiling

@arkpar arkpar changed the title Sync performance Cumulus sync performance May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug The node fails to follow expected behavior. U2-some_time_soon Issue is worth doing soon.
Projects
None yet
Development

No branches or pull requests

3 participants