-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Lost peers due to state-cache #10130
Comments
I'd really like to know why it's refused? is it because of a bad reputation? how can I check this, thanks |
the connection to another three bootnodes is refused:
but seems no reason was given |
substrate/client/network/src/protocol.rs Lines 1480 to 1483 in 5e2b0f0
|
@tomaka sorry to ping you, we would like to fix this issue as it's on live, do you mind having a look? |
just discoverred it's back now with 17 peers so this error happened for a few hours |
this error happened again on one of our collator: https://telemetry.polkadot.io/#list/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b |
The |
How to check the reputation? The latter shouldn't be our case because it's our own node and we didn't find any potential DDOS attack yet. |
You can use |
ok let me test |
Reputations start being "bad" below approximately |
@tomaka so you are saying that we should increase VPS's size right. Currently we are using 8Cores + 16GB and 10Gbps for our collators, do you have suggestions? |
This should be more than enough, as far as I know. In my opinion, something (i.e. some event) happened and caused the nodes to act very slowly for a few hours. It would be nice to figure out what this "something" is, but we don't have any clue here unfortunately. |
@tomaka ok we will investigate it and let you know, great thanks on the help |
@tomaka we didn't find any attack, so I believe it's ok on this part, also we don't have any error logs yet, but we constantly have one collator disconnected with the others https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b this time it's |
hi @tomaka I think there is a bug, restarting the nodes can help solving this for some hours but then it starts to lose peers again. do you have some advice that we can apply, util now we don't have any error logs, the number of peers just go down periodically |
Please check the CPU usage of the nodes and see if it correlates with the disconnections. |
I think it looks totally fine. Actually we didn't find anything strange except |
hi @tomaka we are still suffering from this issue, we are restarting our collators everyday:) it helps but not much https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b as you can see from the following, |
after adding |
since these three days our collators are working fine by adjusting |
I see the same issue on my network with a few nodes. There was nothing special happening there, node was working fine for a few days and suddenly that happened. Shouldn't it "just work" without need to disable state cache? I think this should be reopened for further investigation, there is clearly something wrong if it happens on simple network with no misbehavior. |
With what cache? The state cache? |
We have a simple and reliable way to reproduce it (though inconvenient). With blockchain size ( Are there some parameters right now that can be tweaked to fix this behavior? |
@nazar-pc is this still relevant? |
Likely yes. What we did on our side is forking Substrate to decrease amount of data that block range response can contain and together with #12146 it got much better, but I don't believe it is solved fundamentally. |
What limit did you decrease? |
|
I'm afraid it is no longer there as we are in the middle between previous and current testnet right now. I can request the team to try and reproduce it, but it may take a while. The gist of our observation is the following: node was making request, but response was so big (initially significantly bigger than default This was primarily reproducible in slow network conditions, like pulling something behind Chinese firewall from server in EU as one of easier to reproduce examples. Those who had strong connectivity didn't have issues, but we are supporting diverse set of users in our protocol and need to be prepared to these less than optimal conditions. |
This would be much appreciated. @dmitry-markin @melekes could you take a look at this? |
Ty! Yeah maybe our time out mechanism isn't implemented properly. I don't know how it works exactly, but it should detect that there is progress being made and not require some kind of ping-pong (I think that is what we are doing) in these cases. |
I can have a look at this and try reproducing the issue with timeouts on slow networks after merging some in-progress PRs. |
May be connected issue: #12105 |
So far it looks like the theoretical bandwidth requirements based on It looks like there is no solution other than "turning knobs here and there" without modifying libp2p, because libp2p considers the whole request to be timed out if the response was not received in full. On the other hand, I'm not sure that keeping connections alive if they can't meet our timeout requirements and just transmit something really makes sense (e.g., slowloris attack?) |
Yes, we have decreased requirements in our protocol to ~1Mbps, unfortunate things is that those are constants and we had to fork Substrate to make those changes. |
But could we not be smarter by starting with less blocks, calculate the bandwidth and then increase the blocks to request based on what we measured? We would then only need some minimum bandwidth requirements. |
In theory we could, but adaptive congestion control on the application layer would be more complicated than just calculating the bandwidth once and setting the block count in the request. We should also account for the change of network conditions, influence of parallel transmission of data by other protocols / parallel download from multiple peers (not sure if we have this), control stability issues. I.e., the congestion control algorithms in TCP are there for a reason. We can implement some POC, but the algorithm must be extremely dumb and straightforward to not introduce new issues (especially I worry for the stability of the feedback loop). |
I agree with @dmitry-markin here There's already so much abstraction below the request-response protocol that implementing adaptive streaming all the way up there sounds quite difficult to get working properly, except for the dumb case. The request-response protocol would have imperfect information as to what is happening in the stream because other protocols are also being multiplexed and consume some amount of the available bandwidth. Also once the amount of bandwidth is over-estimated and too much data is sent, the connection is closed which is quite a bit more harsh than TCP dropping packets, adjusting window size and resending those packets. Maybe we could make those knobs more available to people wouldn't have to fork Substrate. |
Good points! Was some random idea by me that I just wanted to write down. Hadn't put that much thoughts into it!
Yes! This wasn't really possible up to now, but now with sync almost being independent it should work! |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/april-updates-for-substrate-and-polkadot-devs/2764/1 |
After upgrading to polkadot-v0.9.10 & polkadot-v0.9.11, we start to observe
no peers
issue on some of our collators, while turning onsync=trace
I found the following logs:The text was updated successfully, but these errors were encountered: