-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parachain runtime upgrade is triggering stall due to peer reputation issues #888
Comments
@eskimor (Basti told me to mention you :) ) |
Also, to add to the hypothesis, I think the impact is less on alphanet, because having 80 collators, the time to get selected to produce the next block (and so get punished again if you are already "banned") is longer and so allows to recover reputation faster than you lose it.... maybe ? |
Thanks @crystalin - we will look into it! |
On our Alphanet network, (that was impacted by this same issue), we deployed a version of the relay that is not punishing for collation fetching issues: moonbeam-foundation/polkadot@dfd86c0 The collators were all able to recover few hours after as you can see from this block production time chart: I'm not sure it is what fixed it but could be (I suppose the peer reputation took few hours to fully recover) |
You also increased the timeout, which would also help in getting the block in eventually. If you restart the validator nodes, peer reputations would be reset - so I don't really see how the peer reputation change alone would explain this. The thing is, if you are hitting the timeout, you are already pushing the boundaries of the system a lot. If the block was big because of a runtime upgrade, then the actual backing is actually really half the work, if at all: After the backing group validated the candidate, they distribute their statements and those statements, which need to reach the block producer, will contain that runtime upgrade! If the runtime upgrade is megabytes in size, and you fully exhausted the new timeout of 1.5 seconds, assuming it only takes 0.1 seconds to validate the candidate and to produce it, then we would have a total of 0.3 seconds to distribute megabytes of data to a thousand validators (on Kusama). I think the reason why this will work eventually is, because we don't really need to reach all validators, we only need to reach the block producer, so by chance the block producer will be close or maybe one of the backing validators themselves, which will be the time when the block inclusion finally succeeds. That's why a larger timeout does not really solve the problem, although it should help in getting the candidate in at least at some point. Good news is, the PR of reducing the number of required votes is already on its way and should help with those issues (it also increases the timeout a bit). |
@eskimor yes I increased the timeout as it helped to pass the PoV without getting punished for the timeout (The collation takes multiple tcp rounds before sending the PoV, making 200ms+ latency impossible to send the PoV in less than 1000ms in default tcp configuration for most linux dist). However the issue here is not the fact the runtime upgrade (or any other big block) doesn't get in. It eventually does. It is usually followed by a lot of small blocks, but those can only get included by a small set of collators (those not banned). All the collators that were not able to include that block are "banned" forever (at least in small networks). The reason is because they keep getting punished while trying to send new blocks, because if you are "banned" and you try to connect, you get punished more or something similar |
In any case an update is on its way, which increases the timeout a bit and reduces the number of required backing votes. This means, if validators on slower connections ban the collator it should matter less: If it is able to deliver its collation at least to two validators out of five - up to 3 slow validators can ban a collator, without causing issues. We will also be looking into getting rid of banning for timeouts completely, as an interim solution. |
I am not sure how a restart of the collators would fix an issue with reputations. Do the collators get a new |
@eskimor yes, I'm also surprised by the restart, and it is not always consistent. What we usually do is to restart the collators and if it doesn't work we restart the validators. It usually end up fixing it. On Alphanet, we don't control the collators so we couldn't do a full restart that is why I removed the peer reputation penalty on collation fetch errors |
Signed-off-by: koushiro <koushiro.cqx@gmail.com> Signed-off-by: koushiro <koushiro.cqx@gmail.com>
* Move estimate-fee * Add parsing test. * Address review comments. * Fix compilation. * Move things around. Co-authored-by: Svyatoslav Nikolsky <svyatonik@gmail.com>
This is happening in multiple networks (including public Moonbase Alphanet, but the impact is lower there). I'll focus on our internal Moonsilver network composed of 4 validators (v0.9.13, runtime 9140, 2 max validators per core) and 8 collators (in 2 different world regions).
When performing a Runtime upgrade, the parachain stops producing blocks for 5-10 minutes, and then slowly 2 collators started to produce blocks. the 6 other collators never produced block again (for at least 1 hour) until we restarted them.
Here is my hypothesis :
Graph of the connections:
Logs of the 8 collators + 4 validators (1h) at the time of the upgrade:
https://drive.google.com/file/d/1d6ig76i1Cg6NAKcAKI4e5tqxpC7WwhvE/view?usp=sharing
(I think the first block having issue is
[🌗] ✨ Imported #35970
)(This same scenario happens on Moonbase Alphanet, however the stall is usually less long (due to diversity of collator locations) and less collators are impacted (because the stall is less long I suppose))
The text was updated successfully, but these errors were encountered: