Parachain runtime upgrade is triggering stall due to peer reputation issues #888

crystalin · 2021-12-28T01:48:14Z

This is happening in multiple networks (including public Moonbase Alphanet, but the impact is lower there). I'll focus on our internal Moonsilver network composed of 4 validators (v0.9.13, runtime 9140, 2 max validators per core) and 8 collators (in 2 different world regions).

When performing a Runtime upgrade, the parachain stops producing blocks for 5-10 minutes, and then slowly 2 collators started to produce blocks. the 6 other collators never produced block again (for at least 1 hour) until we restarted them.

Here is my hypothesis :

Runtime upgrade is creating a special condition (higher than usual PoV, + maybe some extra work on the collator/validator.
That block is not able to be fetched in time by the validators. After many, many tries, one of the collator succeeds
Validators are reducing peer reputation of the collator because of "Timeout"
Collators end up being "banned" but are still trying to produce blocks
"Banned" collators are trying to connect again after producing a block, which gets refused by the validator, which reduce even more their reputation
This snowballs and the collators are never able to connect again and produce blocks

Graph of the connections:

Logs of the 8 collators + 4 validators (1h) at the time of the upgrade:
https://drive.google.com/file/d/1d6ig76i1Cg6NAKcAKI4e5tqxpC7WwhvE/view?usp=sharing
(I think the first block having issue is [🌗] ✨ Imported #35970)

(This same scenario happens on Moonbase Alphanet, however the stall is usually less long (due to diversity of collator locations) and less collators are impacted (because the stall is less long I suppose))

The text was updated successfully, but these errors were encountered:

crystalin · 2021-12-28T01:48:34Z

@eskimor (Basti told me to mention you :) )

crystalin · 2021-12-28T02:03:16Z

Also, to add to the hypothesis, I think the impact is less on alphanet, because having 80 collators, the time to get selected to produce the next block (and so get punished again if you are already "banned") is longer and so allows to recover reputation faster than you lose it.... maybe ?

eskimor · 2021-12-28T22:58:27Z

Thanks @crystalin - we will look into it!

crystalin · 2021-12-29T01:39:24Z

On our Alphanet network, (that was impacted by this same issue), we deployed a version of the relay that is not punishing for collation fetching issues: moonbeam-foundation/polkadot@dfd86c0

The collators were all able to recover few hours after as you can see from this block production time chart:

I'm not sure it is what fixed it but could be (I suppose the peer reputation took few hours to fully recover)

eskimor · 2021-12-29T09:07:39Z

You also increased the timeout, which would also help in getting the block in eventually. If you restart the validator nodes, peer reputations would be reset - so I don't really see how the peer reputation change alone would explain this.

The thing is, if you are hitting the timeout, you are already pushing the boundaries of the system a lot. If the block was big because of a runtime upgrade, then the actual backing is actually really half the work, if at all: After the backing group validated the candidate, they distribute their statements and those statements, which need to reach the block producer, will contain that runtime upgrade! If the runtime upgrade is megabytes in size, and you fully exhausted the new timeout of 1.5 seconds, assuming it only takes 0.1 seconds to validate the candidate and to produce it, then we would have a total of 0.3 seconds to distribute megabytes of data to a thousand validators (on Kusama).

I think the reason why this will work eventually is, because we don't really need to reach all validators, we only need to reach the block producer, so by chance the block producer will be close or maybe one of the backing validators themselves, which will be the time when the block inclusion finally succeeds.

That's why a larger timeout does not really solve the problem, although it should help in getting the candidate in at least at some point.

Good news is, the PR of reducing the number of required votes is already on its way and should help with those issues (it also increases the timeout a bit).

crystalin · 2021-12-29T14:29:52Z

@eskimor yes I increased the timeout as it helped to pass the PoV without getting punished for the timeout (The collation takes multiple tcp rounds before sending the PoV, making 200ms+ latency impossible to send the PoV in less than 1000ms in default tcp configuration for most linux dist).

However the issue here is not the fact the runtime upgrade (or any other big block) doesn't get in. It eventually does. It is usually followed by a lot of small blocks, but those can only get included by a small set of collators (those not banned). All the collators that were not able to include that block are "banned" forever (at least in small networks). The reason is because they keep getting punished while trying to send new blocks, because if you are "banned" and you try to connect, you get punished more or something similar

eskimor · 2021-12-30T11:12:54Z

In any case an update is on its way, which increases the timeout a bit and reduces the number of required backing votes. This means, if validators on slower connections ban the collator it should matter less: If it is able to deliver its collation at least to two validators out of five - up to 3 slow validators can ban a collator, without causing issues.

We will also be looking into getting rid of banning for timeouts completely, as an interim solution.

eskimor · 2021-12-30T11:32:50Z

When performing a Runtime upgrade, the parachain stops producing blocks for 5-10 minutes, and then slowly 2 collators started to produce blocks. the 6 other collators never produced block again (for at least 1 hour) until we restarted them.

I am not sure how a restart of the collators would fix an issue with reputations. Do the collators get a new PeerId when they are restarted or have they been shut down for a longer period of time?

crystalin · 2021-12-30T14:36:27Z

@eskimor yes, I'm also surprised by the restart, and it is not always consistent. What we usually do is to restart the collators and if it doesn't work we restart the validators. It usually end up fixing it.

On Alphanet, we don't control the collators so we couldn't do a full restart that is why I removed the peer reputation penalty on collation fetch errors

Signed-off-by: koushiro <koushiro.cqx@gmail.com> Signed-off-by: koushiro <koushiro.cqx@gmail.com>

* Move estimate-fee * Add parsing test. * Address review comments. * Fix compilation. * Move things around. Co-authored-by: Svyatoslav Nikolsky <svyatonik@gmail.com>

crystalin mentioned this issue Dec 28, 2021

Parachain collator trying to connect to wrong validator paritytech/polkadot#4630

Closed

eskimor mentioned this issue Dec 30, 2021

Don't change rep on timeout in collator protocol. paritytech/polkadot#4642

Merged

slumber mentioned this issue Sep 16, 2022

Buffered connection management for collator-protocol paritytech/polkadot#6022

Merged

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024

Fix docs CI (paritytech#888)

eef5723

Signed-off-by: koushiro <koushiro.cqx@gmail.com> Signed-off-by: koushiro <koushiro.cqx@gmail.com>

bkchr closed this as completed Apr 2, 2024

bkchr pushed a commit that referenced this issue Apr 10, 2024

CLI: Estimate Fee (#888)

a256697

* Move estimate-fee * Add parsing test. * Address review comments. * Fix compilation. * Move things around. Co-authored-by: Svyatoslav Nikolsky <svyatonik@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parachain runtime upgrade is triggering stall due to peer reputation issues #888

Parachain runtime upgrade is triggering stall due to peer reputation issues #888

crystalin commented Dec 28, 2021 •

edited

Loading

crystalin commented Dec 28, 2021

crystalin commented Dec 28, 2021

eskimor commented Dec 28, 2021

crystalin commented Dec 29, 2021 •

edited

Loading

eskimor commented Dec 29, 2021

crystalin commented Dec 29, 2021 •

edited

Loading

eskimor commented Dec 30, 2021 •

edited

Loading

eskimor commented Dec 30, 2021

crystalin commented Dec 30, 2021 •

edited

Loading

Parachain runtime upgrade is triggering stall due to peer reputation issues #888

Parachain runtime upgrade is triggering stall due to peer reputation issues #888

Comments

crystalin commented Dec 28, 2021 • edited Loading

crystalin commented Dec 28, 2021

crystalin commented Dec 28, 2021

eskimor commented Dec 28, 2021

crystalin commented Dec 29, 2021 • edited Loading

eskimor commented Dec 29, 2021

crystalin commented Dec 29, 2021 • edited Loading

eskimor commented Dec 30, 2021 • edited Loading

eskimor commented Dec 30, 2021

crystalin commented Dec 30, 2021 • edited Loading

crystalin commented Dec 28, 2021 •

edited

Loading

crystalin commented Dec 29, 2021 •

edited

Loading

crystalin commented Dec 29, 2021 •

edited

Loading

eskimor commented Dec 30, 2021 •

edited

Loading

crystalin commented Dec 30, 2021 •

edited

Loading