-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Cumulus sync performance #9360
Comments
I think we might meet this bug, |
Are you experiencing this only recently? We did some refactoring for the syncing implementation and those changes don't play too well with other open issues with |
actually, we often got the reports, but we can't reproduce them... so we just assume this is a hardware or network issue Normally we restore Relaychain DB from a snapshot Until recently, my colleague and I both sync new nodes, all from 0, and we meet the issue. My colleague rough debug:
As for me, I don't see visible issue on my local server, but my Azure VM has the same problem, I guess my local server has Dell flagship RAID card that helps me mitigate the problem, but unluck for the Azure VM, the sync speed from ~500bps to ~0bps |
I tried to sync Polkadot first, then Phala on that Azure VM (4 cores Xeon 2288G, 16G, and 3T premium SSD), the sync was fast, I can see Phala sync > 800 bps, but a few hours later it slow now
|
We have gotten a lot of reports like this and have likely identified the issue. It should be fixed in the next release. |
Does the next release mean 0.9.42? Could you share the related PR? |
Yes, #13829 |
I tried patching that PR to our node, but it seems no help for syncing speed, maybe there are other issues?
|
Can you run the node with |
Uploaded It seems too many
|
It looks like there is something wrong with block import. It says "too many blocks in the queue" which prevents If I give you a patch with some more logging, are you able to apply that to your node and run it again with trace logging? |
No problem! BTW, our code based on Polkadot v0.9.41, and I cherry-picked #13829 and #13941 |
Here's the patch: logging.txt, cheers! It should reveal if the communication between the import queue and syncing is lagging. I'll see if I can reproduce this locally as well. But there seems to be something wrong with the code that decides if a block request is sent to peer:
If locally I think the peer's best block is 17643313, then I should be able to download block 3491983 from them. @arkpar Does this look wrong to you or am I misunderstanding something about the block request scheduler? |
This probably means that the block import is slow for whatever reason. I'm not familiar with phala and how computationaly heavy their blocks supposed to be. Profiling would help. Or at least a stack dump taken with Btw, this has nothing to do with the original issue description. |
I was looking at the logs again and it looks like the relay chain block import is getting stuck (or at least taking a long time). It schedules 64 blocks for import and then imports a few of them
but when it has imported block 3499005 and starts importing the block with header
|
Here's the new log, please take a look |
I can try another sync, the performance will regress when the progress is reach ~50%, even the node doesn't trigger the critical problem, it still will reduce to < 50bps I'm suspecting |
Could you try syncing again without the flag? |
Sure, I'll try upload a new log from the node start, then remove Should I dump log after the sync speed start slowing? |
This is the log from start |
Looking at the logs you provided, the issue doesn't seem to be communication between syncing and import queue but something else so I'm not sure it makes a difference whether you capture them from the start or only after the sync has started slowing down. |
Another new sync on Azure VM, paritydb with default
Slight better than Any idea how to profile the node? |
After 2 days, the node syncing speed reduce to ~0 bps again, so pruning mode is not related Any idea that how to debug / profile this case? |
Can you update this issue instead: paritytech/polkadot-sdk#13 Somebody is probably able to help guide you with profiling |
When the cumulus node start syncing from genesis or significantly behind the tip of the network, it syncs two chains in parallel. Best block and finality of the parachain depends on the state of the relay chain, so behaviour during the sync depends on which of the two chains syncs faster.
When the parachain syncs and blocks are imported with
NetworkInitialSync
origin, the head is moved to the latest imported block regardless of the relay chain state. But then, there's also some logic to set the head based on the relay chain block once it is imported. So if the relay chain is importing behind the parachain, the head is moved backwards and forwards by thousands of blocks, leading to huge DB updates.When the relay chain is falling behind the parachain, finality of the parachain is also falling behind. Long unfinalized chains lead to more performance issues in the database. On each import the tree route to the last finalized block is queried from the db. When there are 100000 such blocks, it slows down import dramatically.
The text was updated successfully, but these errors were encountered: