Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parachain runtime upgrade is triggering stall due to peer reputation issues #888

Closed
crystalin opened this issue Dec 28, 2021 · 9 comments
Closed

Comments

@crystalin
Copy link

crystalin commented Dec 28, 2021

This is happening in multiple networks (including public Moonbase Alphanet, but the impact is lower there). I'll focus on our internal Moonsilver network composed of 4 validators (v0.9.13, runtime 9140, 2 max validators per core) and 8 collators (in 2 different world regions).

When performing a Runtime upgrade, the parachain stops producing blocks for 5-10 minutes, and then slowly 2 collators started to produce blocks. the 6 other collators never produced block again (for at least 1 hour) until we restarted them.

Here is my hypothesis :

  • Runtime upgrade is creating a special condition (higher than usual PoV, + maybe some extra work on the collator/validator.
  • That block is not able to be fetched in time by the validators. After many, many tries, one of the collator succeeds
  • Validators are reducing peer reputation of the collator because of "Timeout"
  • Collators end up being "banned" but are still trying to produce blocks
  • "Banned" collators are trying to connect again after producing a block, which gets refused by the validator, which reduce even more their reputation
  • This snowballs and the collators are never able to connect again and produce blocks

Graph of the connections:
image (2)

Logs of the 8 collators + 4 validators (1h) at the time of the upgrade:
https://drive.google.com/file/d/1d6ig76i1Cg6NAKcAKI4e5tqxpC7WwhvE/view?usp=sharing
(I think the first block having issue is [🌗] ✨ Imported #35970)

(This same scenario happens on Moonbase Alphanet, however the stall is usually less long (due to diversity of collator locations) and less collators are impacted (because the stall is less long I suppose))

@crystalin
Copy link
Author

@eskimor (Basti told me to mention you :) )

@crystalin
Copy link
Author

Also, to add to the hypothesis, I think the impact is less on alphanet, because having 80 collators, the time to get selected to produce the next block (and so get punished again if you are already "banned") is longer and so allows to recover reputation faster than you lose it.... maybe ?

@eskimor
Copy link
Member

eskimor commented Dec 28, 2021

Thanks @crystalin - we will look into it!

@crystalin
Copy link
Author

crystalin commented Dec 29, 2021

On our Alphanet network, (that was impacted by this same issue), we deployed a version of the relay that is not punishing for collation fetching issues: moonbeam-foundation/polkadot@dfd86c0

The collators were all able to recover few hours after as you can see from this block production time chart:
image

I'm not sure it is what fixed it but could be (I suppose the peer reputation took few hours to fully recover)

@eskimor
Copy link
Member

eskimor commented Dec 29, 2021

You also increased the timeout, which would also help in getting the block in eventually. If you restart the validator nodes, peer reputations would be reset - so I don't really see how the peer reputation change alone would explain this.

The thing is, if you are hitting the timeout, you are already pushing the boundaries of the system a lot. If the block was big because of a runtime upgrade, then the actual backing is actually really half the work, if at all: After the backing group validated the candidate, they distribute their statements and those statements, which need to reach the block producer, will contain that runtime upgrade! If the runtime upgrade is megabytes in size, and you fully exhausted the new timeout of 1.5 seconds, assuming it only takes 0.1 seconds to validate the candidate and to produce it, then we would have a total of 0.3 seconds to distribute megabytes of data to a thousand validators (on Kusama).

I think the reason why this will work eventually is, because we don't really need to reach all validators, we only need to reach the block producer, so by chance the block producer will be close or maybe one of the backing validators themselves, which will be the time when the block inclusion finally succeeds.

That's why a larger timeout does not really solve the problem, although it should help in getting the candidate in at least at some point.

Good news is, the PR of reducing the number of required votes is already on its way and should help with those issues (it also increases the timeout a bit).

@crystalin
Copy link
Author

crystalin commented Dec 29, 2021

@eskimor yes I increased the timeout as it helped to pass the PoV without getting punished for the timeout (The collation takes multiple tcp rounds before sending the PoV, making 200ms+ latency impossible to send the PoV in less than 1000ms in default tcp configuration for most linux dist).

However the issue here is not the fact the runtime upgrade (or any other big block) doesn't get in. It eventually does. It is usually followed by a lot of small blocks, but those can only get included by a small set of collators (those not banned). All the collators that were not able to include that block are "banned" forever (at least in small networks). The reason is because they keep getting punished while trying to send new blocks, because if you are "banned" and you try to connect, you get punished more or something similar

@eskimor
Copy link
Member

eskimor commented Dec 30, 2021

In any case an update is on its way, which increases the timeout a bit and reduces the number of required backing votes. This means, if validators on slower connections ban the collator it should matter less: If it is able to deliver its collation at least to two validators out of five - up to 3 slow validators can ban a collator, without causing issues.

We will also be looking into getting rid of banning for timeouts completely, as an interim solution.

@eskimor
Copy link
Member

eskimor commented Dec 30, 2021

When performing a Runtime upgrade, the parachain stops producing blocks for 5-10 minutes, and then slowly 2 collators started to produce blocks. the 6 other collators never produced block again (for at least 1 hour) until we restarted them.

I am not sure how a restart of the collators would fix an issue with reputations. Do the collators get a new PeerId when they are restarted or have they been shut down for a longer period of time?

@crystalin
Copy link
Author

crystalin commented Dec 30, 2021

@eskimor yes, I'm also surprised by the restart, and it is not always consistent. What we usually do is to restart the collators and if it doesn't work we restart the validators. It usually end up fixing it.

On Alphanet, we don't control the collators so we couldn't do a full restart that is why I removed the peer reputation penalty on collation fetch errors

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
Signed-off-by: koushiro <koushiro.cqx@gmail.com>

Signed-off-by: koushiro <koushiro.cqx@gmail.com>
@bkchr bkchr closed this as completed Apr 2, 2024
bkchr pushed a commit that referenced this issue Apr 10, 2024
* Move estimate-fee

* Add parsing test.

* Address review comments.

* Fix compilation.

* Move things around.

Co-authored-by: Svyatoslav Nikolsky <svyatonik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants