Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parachain block skipped by validators on Kusama #950

Open
crystalin opened this issue Sep 13, 2021 · 11 comments
Open

Parachain block skipped by validators on Kusama #950

crystalin opened this issue Sep 13, 2021 · 11 comments

Comments

@crystalin
Copy link

crystalin commented Sep 13, 2021

This ticket is to start thinking of improving the current situation on Kusama.
It happens from time to time that a parachain block is not included in the relaychain for many consecutive blocks.

On average, it happens of 1 minutes to have 1 slot missed (the block is ingested in the relay 18s after instead of the normal 12s).
Also, around once every hour, it misses many slots, taking around 1 minutes to ingest 1 block.

20:18:16.809 INFO tokio-runtime-worker substrate: [🌗] 💤 Idle (88 peers), best: #524119 (0x0af3…7ed4), finalized #524118 (0xabfa…560a), ⬇ 179.5kiB/s ⬆ 300.9kiB/s    
20 18:05.181  INFO tokio-runtime-workersubstrate: [🌗] ✨ Imported #524119 (0x7a18…3b05)    
20:18:04.559 INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #524119 (0x0af3…7ed4)    
20:17:56.220 INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #524119 (0x6388…e541)    
20:17:49.175 INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #524119 (0x8bd2…72ab)    
20:17:38.786 INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #524119 (0xca63…1280)    
20:17:33.258 INFO tokio-runtime-worker substrate: [🌗] ✨ Imported #524119 (0x603b…f23f)    

image
Description:
moonriver Block 1 [ 2%][Ext: 3(Eth: 4)(Z: ...)][Pool: 5(+ 6)(Z ...)][ 7s](hash: 8) by 9

  1. block number
  2. weight percentage (sum of all ExtrinsicSuccess/Failure events)
  3. Number of extrinsics in the block (including inherents and ethereum)
  4. Number of ethereum from those extrinsics
  5. Number of extrinsics in the pool snapshotted at the moment the block is considered best (this data is not from the block itself)
  6. Number of new extrinsics in the pool compare to previous snapshot
  7. Time between this block and the previous one
  8. Hash of the block
  9. Author of the block

I believe this is due to the validators not doing a good job including those blocks.
For information, all those blocks were produced by a different collator each time.

I checked the PoV and those blocks have around ~300kb (reported by PoV log line, so I think it is uncompressed)
It happens more when the blocks are containing many transaction, but the execution time is never over 300ms on decent hardware.

One possible reason could be that the validators are not fast enough to validate the parachain block but with the 300ms time to execute (and it usually takes less time to validate), I kind of doubt it

@crystalin
Copy link
Author

Another example of occurrence to show it happens on heavier block but not always.
image

@bkchr
Copy link
Member

bkchr commented Sep 13, 2021

On average, it happens of 1 minutes to have 1 slot missed (the block is ingested in the relay 18s after instead of the normal 12s).

I think this is approval voting, as the parachain block can only be backed in the relay chain block that builds on the same relay chain block as the parachain block. So, I think this is probably networking related, cc @eskimor

@bkchr
Copy link
Member

bkchr commented Sep 13, 2021

@crystalin can you describe the columns of your images?

@crystalin
Copy link
Author

crystalin commented Sep 13, 2021

moonriver Block 1 [ 2%][Ext: 3(Eth: 4)(Z: ...)][Pool: 5(+ 6)(Z ...)][ 7s](hash: 8) by 9

  1. block number
  2. weight percentage (sum of all ExtrinsicSuccess/Failure events)
  3. Number of extrinsics in the block (including inherents and ethereum)
  4. Number of ethereum from those extrinsics
  5. Number of extrinsics in the pool snapshotted at the moment the block is considered best (this data is not from the block itself)
  6. Number of new extrinsics in the pool compare to previous snapshot
  7. Time between this block and the previous one
  8. Hash of the block
  9. Author of the block

@DICO-TEAM
Copy link

We should notify all node operators to perform certain performance upgrades on their nodes, including network upgrades. Then come to see if the problem is solved.

@eskimor
Copy link
Member

eskimor commented Sep 14, 2021

Yes, scaling up hardware/network should definitely help. Longer term we obviously want to improve performance & get contextual execution in.

I think this is approval voting, as the parachain block can only be backed in the relay chain block that builds on the same relay chain block as the parachain block. So, I think this is probably networking related, cc @eskimor

More availability distribution. Yes could be networking issues or heavy load, in case of the latter distribution taking longer is a good thing as it prevents more severe consequences (blocks not getting on chain at all for example).

That being said, we do see networking problems on Kusama and it is quite likely that those are indeed related to this issue. They are still being investigated.

@crystalin
Copy link
Author

crystalin commented Sep 14, 2021

@eskimor and @bkchr , I think for this case it might not be directly related to the load.
We checked our internal (permanent) networks and we are seeing the same behavior on a non-used (no tx sent) network with permanent connections and validator groups.
This behavior started with our parachain client upgrade to polkadot-v0.9.9 (note that the runtime hasn't changed, and relay were in 9080 (client 0.9.8), but are now in 9090 (client 0.9.9.) and we still see this issue)

I can enable some logs on our validators if you tell me which one would be interesting to track

@eskimor
Copy link
Member

eskimor commented Sep 14, 2021

Can you check the following metrics on your internal network:

avg(polkadot_parachain_peer_count{chain="kusama", protocol="/polkadot/validation/1"})

in comparison to:

avg(polkadot_parachain_desired_peer_count{chain="kusama", protocol="/polkadot/validation/1"})

That would be interesting, also how many nodes are running on that internal network?

@crystalin
Copy link
Author

crystalin commented Sep 14, 2021

There are cases were a full validator group rotation (2 minutes) didn't include any block:

 moonriver Block 529199  [ 7.61%][Ext: 24(Eth: 21)(Z:  4)][Pool:  23(+ 20)(Z   2)][ 11.8s](hash: 0xe6bd5..ddb0) by 0xbe7af..4738
 moonriver Block 529200  [10.52%][Ext:  8(Eth:  5)(Z:  0)][Pool: 130(+109)(Z  27)][144.8s](hash: 0x1d4b9..382b) by 0x6eeb4..b616
 moonriver Block 529201  [28.55%][Ext: 26(Eth: 23)(Z: 18)][Pool: 120(+ 13)(Z  10)][ 11.3s](hash: 0x0bd58..1991) by 0x587e5..f029

The block 529200 was quite big, PoV was ranging from 500kb to 1000kb for that block (different collators):

15:23:06.732  INFO [🌗] PoV size { header: 0.9951171875kb, extrinsics: 14.12890625kb, storage_proof: 938.017578125kb }
15:22:55.055  INFO [🌗] PoV size { header: 0.4326171875kb, extrinsics: 5.7412109375kb, storage_proof: 515.857421875kb }
15:22:55.000  INFO [🌗] PoV size { header: 0.4013671875kb, extrinsics: 5.3583984375kb, storage_proof: 447.1220703125kb }
15:22:54.938  INFO [🌗] PoV size { header: 0.4013671875kb, extrinsics: 5.3583984375kb, storage_proof: 447.1220703125kb }
15:22:54.898  INFO [🌗] PoV size { header: 0.4013671875kb, extrinsics: 5.3583984375kb, storage_proof: 447.1220703125kb }
15:22:31.045  INFO [🌗] PoV size { header: 0.5576171875kb, extrinsics: 6.552734375kb, storage_proof: 713.98828125kb }

@eskimor I'll provide that information quickly

@eskimor
Copy link
Member

eskimor commented Nov 26, 2021

Does this issue still exist?

@crystalin
Copy link
Author

It is , and is also related to paritytech/substrate#10359 (and others)

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants