Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Lost peers due to state-cache #10130

Closed
GopherJ opened this issue Oct 31, 2021 · 45 comments · Fixed by #13824
Closed

Lost peers due to state-cache #10130

GopherJ opened this issue Oct 31, 2021 · 45 comments · Fixed by #13824
Assignees

Comments

@GopherJ
Copy link
Contributor

GopherJ commented Oct 31, 2021

After upgrading to polkadot-v0.9.10 & polkadot-v0.9.11, we start to observe no peers issue on some of our collators, while turning on sync=trace I found the following logs:

image

@GopherJ
Copy link
Contributor Author

GopherJ commented Oct 31, 2021

I'd really like to know why it's refused? is it because of a bad reputation? how can I check this, thanks

@GopherJ
Copy link
Contributor Author

GopherJ commented Oct 31, 2021

the connection to another three bootnodes is refused:

2021-10-31 02:04:46.972 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz
2021-10-31 02:04:46.973 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz") failed: Refused.
2021-10-31 02:04:46.974 DEBUG tokio-runtime-worker sync: [Parachain] 12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz disconnected
2021-10-31 02:06:32.190 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd
2021-10-31 02:06:32.191 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd") failed: Refused.
2021-10-31 02:06:32.191 DEBUG tokio-runtime-worker sync: [Parachain] 12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd disconnected
2021-10-31 02:07:03.711 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWL63x8ZPkY2ZekUqyvyNwsakwbuy8Rq3Dt9tJcxw5NFTt
2021-10-31 02:07:03.810 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWL63x8ZPkY2ZekUqyvyNwsakwbuy8Rq3Dt9tJcxw5NFTt") failed: Refused.

but seems no reason was given

@GopherJ
Copy link
Contributor Author

GopherJ commented Oct 31, 2021

RequestFailure::Refused => {
self.peerset_handle.report_peer(*id, rep::REFUSED);
self.behaviour.disconnect_peer(id, HARDCODED_PEERSETS_SYNC);
},

@GopherJ
Copy link
Contributor Author

GopherJ commented Oct 31, 2021

@tomaka sorry to ping you, we would like to fix this issue as it's on live, do you mind having a look?

@GopherJ
Copy link
Contributor Author

GopherJ commented Oct 31, 2021

just discoverred it's back now with 17 peers so this error happened for a few hours

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 1, 2021

@tomaka
Copy link
Contributor

tomaka commented Nov 1, 2021

The Refused error happens if the reputation of the source node is too low, or if the node targeted by the request is overloaded and is uncapable of processing more block requests (to prevent DoS attacks)

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 1, 2021

The Refused error happens if the reputation of the source node is too low, or if the node targeted by the request is overloaded and is uncapable of processing more block requests (to prevent DoS attacks)

How to check the reputation? The latter shouldn't be our case because it's our own node and we didn't find any potential DDOS attack yet.

@tomaka
Copy link
Contributor

tomaka commented Nov 1, 2021

You can use -l peerset=trace to see all reputation changes. If there's no error in the logs, it is likely that this doesn't relate to reputations. Instead, chances are that the CPU or disk the nodes are running on isn't powerful enough.

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 1, 2021

ok let me test

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 1, 2021

image

I saw this, from this disconnected collator's perspective, the others have a pretty bad reputation

If I check this reputation from other nodes' perspective, this disconnected node has 0 as reputation

@tomaka
Copy link
Contributor

tomaka commented Nov 1, 2021

Reputations start being "bad" below approximately -2^28 if I remember correctly.
The reputations displayed here shouldn't have any consequence.

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 1, 2021

@tomaka so you are saying that we should increase VPS's size right. Currently we are using 8Cores + 16GB and 10Gbps for our collators, do you have suggestions?

@tomaka
Copy link
Contributor

tomaka commented Nov 1, 2021

This should be more than enough, as far as I know.

In my opinion, something (i.e. some event) happened and caused the nodes to act very slowly for a few hours. It would be nice to figure out what this "something" is, but we don't have any clue here unfortunately.
Getting a more powerful machine would mitigate the problem if/when it happens again.

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 1, 2021

@tomaka ok we will investigate it and let you know, great thanks on the help

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 3, 2021

@tomaka we didn't find any attack, so I believe it's ok on this part, also we don't have any error logs yet, but we constantly have one collator disconnected with the others

https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b

this time it's heiko-collator-0

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 4, 2021

hi @tomaka I think there is a bug, restarting the nodes can help solving this for some hours but then it starts to lose peers again.

do you have some advice that we can apply, util now we don't have any error logs, the number of peers just go down periodically

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 8, 2021

image

this really doesn't look cool, what type of information can I post to help fixing this issue?

@tomaka
Copy link
Contributor

tomaka commented Nov 8, 2021

Please check the CPU usage of the nodes and see if it correlates with the disconnections.

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 8, 2021

@tomaka
image

I think it looks totally fine. Actually we didn't find anything strange except Request to peerid failed: Refused

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 11, 2021

hi @tomaka we are still suffering from this issue, we are restarting our collators everyday:) it helps but not much

https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b

as you can see from the following, heiko-collator-4 and heiko-collator-7 got refused by the others today
image

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 18, 2021

I tried to use reserved-nodes but it's also disconnecting no matter what I do
image

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 22, 2021

after adding --state-cache-size 1 it looks much better now but not sure yet if it's resolved

@GopherJ
Copy link
Contributor Author

GopherJ commented Nov 23, 2021

since these three days our collators are working fine by adjusting --state-cache-size to 1, I think this issue can be closed

@GopherJ GopherJ closed this as completed Nov 23, 2021
@nazar-pc
Copy link
Contributor

I see the same issue on my network with a few nodes. There was nothing special happening there, node was working fine for a few days and suddenly that happened. Shouldn't it "just work" without need to disable state cache?

I think this should be reopened for further investigation, there is clearly something wrong if it happens on simple network with no misbehavior.

@GopherJ GopherJ reopened this Dec 16, 2021
@GopherJ
Copy link
Contributor Author

GopherJ commented Dec 16, 2021

@tomaka @bkchr @nazar-pc I'm reopening this because there are definitely some issues with the cache.

@GopherJ GopherJ changed the title Request to peerId failed: Refused Lose peers due to state-cache Dec 16, 2021
@GopherJ GopherJ changed the title Lose peers due to state-cache Lost peers due to state-cache Dec 16, 2021
@bkchr
Copy link
Member

bkchr commented Dec 16, 2021

With what cache? The state cache?

@nazar-pc
Copy link
Contributor

We have a simple and reliable way to reproduce it (though inconvenient).

With blockchain size (chains directory) of ~200G and 2 nodes available, third node joining the network is fairly quickly banned by other nodes. This happens every time long before 100G mark. Last attempt failed after just a few gigabytes.

Are there some parameters right now that can be tweaked to fix this behavior?

@bkchr
Copy link
Member

bkchr commented Feb 13, 2023

@nazar-pc is this still relevant?

@nazar-pc
Copy link
Contributor

Likely yes. What we did on our side is forking Substrate to decrease amount of data that block range response can contain and together with #12146 it got much better, but I don't believe it is solved fundamentally.

@bkchr
Copy link
Member

bkchr commented Feb 13, 2023

What limit did you decrease?

@nazar-pc
Copy link
Contributor

MAX_BODY_BYTES: 5f851f1

@bkchr
Copy link
Member

bkchr commented Feb 14, 2023

CC @altonen the title here is probably misleading at this should not be related to the state-cache (which also not exists anymore today), but maybe some networking issue when it is improved by decreasing the mentioned max body size.

@nazar-pc could you provide your mentioned reproduction?

@nazar-pc
Copy link
Contributor

I'm afraid it is no longer there as we are in the middle between previous and current testnet right now. I can request the team to try and reproduce it, but it may take a while.

The gist of our observation is the following: node was making request, but response was so big (initially significantly bigger than default MAX_BODY_SIZE due to old implementation logic) that node wasn't able to always pull it within keepalive timeout (I believe this is how it is called). With #12146 and MAX_BODY_BYTES decrease we achieved lower maximum expected bandwidth requirement from nodes before timeout kicks in and that way connection wasn't dropped. So the peers are connected to each other and they successfully transfer data, but because a single response takes so long (it shouldn't block TCP connection though because messages are fragmented by muxer anyway) it was having such effect as if connection was idling for a long time and timing out.

This was primarily reproducible in slow network conditions, like pulling something behind Chinese firewall from server in EU as one of easier to reproduce examples. Those who had strong connectivity didn't have issues, but we are supporting diverse set of users in our protocol and need to be prepared to these less than optimal conditions.

@altonen
Copy link
Contributor

altonen commented Feb 15, 2023

I can request the team to try and reproduce it, but it may take a while.

This would be much appreciated. @dmitry-markin @melekes could you take a look at this?

@bkchr
Copy link
Member

bkchr commented Feb 15, 2023

So the peers are connected to each other and they successfully transfer data, but because a single response takes so long (it shouldn't block TCP connection though because messages are fragmented by muxer anyway) it was having such effect as if connection was idling for a long time and timing out.

Ty! Yeah maybe our time out mechanism isn't implemented properly. I don't know how it works exactly, but it should detect that there is progress being made and not require some kind of ping-pong (I think that is what we are doing) in these cases.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Feb 15, 2023

I can request the team to try and reproduce it, but it may take a while.

This would be much appreciated. @dmitry-markin @melekes could you take a look at this?

I can have a look at this and try reproducing the issue with timeouts on slow networks after merging some in-progress PRs.

@dmitry-markin dmitry-markin self-assigned this Feb 16, 2023
@dmitry-markin
Copy link
Contributor

May be connected issue: #12105

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Feb 27, 2023

So far it looks like the theoretical bandwidth requirements based on MAX_BODY_BYTES of 8 MB and block request protocol timeout set to 20 s is 8 MB / 20 s = 3.4 Mbit/s. In my (arbitrary) tests with syncing polkadot chain, 1 Mbit/s was enough to download blocks around block number 14357815, but 512 kbit/s connection limit lead to block request protocol timing out after 20 seconds, and the peer we are syncing from being disconnected and backed-off. The things can be theoretically improved client-side by reducing the number of blocks requested in a single request in MAX_BLOCKS_TO_REQUEST. Reducing it from 64 to 32 made sync work in my setup on 512 kbit/s.

It looks like there is no solution other than "turning knobs here and there" without modifying libp2p, because libp2p considers the whole request to be timed out if the response was not received in full. On the other hand, I'm not sure that keeping connections alive if they can't meet our timeout requirements and just transmit something really makes sense (e.g., slowloris attack?)

@nazar-pc
Copy link
Contributor

Yes, we have decreased requirements in our protocol to ~1Mbps, unfortunate things is that those are constants and we had to fork Substrate to make those changes.

@bkchr
Copy link
Member

bkchr commented Feb 27, 2023

It looks like there is no solution other than "turning knobs here and there" without modifying libp2p, because libp2p considers the whole request to be timed out if the response was not received in full. On the other hand, I'm not sure that keeping connections alive if they can't meet our timeout requirements and just transmit something really makes sense (e.g., slowloris attack?)

But could we not be smarter by starting with less blocks, calculate the bandwidth and then increase the blocks to request based on what we measured? We would then only need some minimum bandwidth requirements.

@dmitry-markin
Copy link
Contributor

But could we not be smarter by starting with less blocks, calculate the bandwidth and then increase the blocks to request based on what we measured? We would then only need some minimum bandwidth requirements.

In theory we could, but adaptive congestion control on the application layer would be more complicated than just calculating the bandwidth once and setting the block count in the request. We should also account for the change of network conditions, influence of parallel transmission of data by other protocols / parallel download from multiple peers (not sure if we have this), control stability issues. I.e., the congestion control algorithms in TCP are there for a reason.

We can implement some POC, but the algorithm must be extremely dumb and straightforward to not introduce new issues (especially I worry for the stability of the feedback loop).

@altonen
Copy link
Contributor

altonen commented Feb 28, 2023

I agree with @dmitry-markin here

There's already so much abstraction below the request-response protocol that implementing adaptive streaming all the way up there sounds quite difficult to get working properly, except for the dumb case. The request-response protocol would have imperfect information as to what is happening in the stream because other protocols are also being multiplexed and consume some amount of the available bandwidth. Also once the amount of bandwidth is over-estimated and too much data is sent, the connection is closed which is quite a bit more harsh than TCP dropping packets, adjusting window size and resending those packets.

Maybe we could make those knobs more available to people wouldn't have to fork Substrate.

@bkchr
Copy link
Member

bkchr commented Mar 1, 2023

Good points! Was some random idea by me that I just wanted to write down. Hadn't put that much thoughts into it!

Maybe we could make those knobs more available to people wouldn't have to fork Substrate.

Yes! This wasn't really possible up to now, but now with sync almost being independent it should work!

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/april-updates-for-substrate-and-polkadot-devs/2764/1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants