Lost peers due to state-cache #10130

GopherJ · 2021-10-31T01:57:36Z

After upgrading to polkadot-v0.9.10 & polkadot-v0.9.11, we start to observe no peers issue on some of our collators, while turning on sync=trace I found the following logs:

The text was updated successfully, but these errors were encountered:

GopherJ · 2021-10-31T01:58:39Z

I'd really like to know why it's refused? is it because of a bad reputation? how can I check this, thanks

GopherJ · 2021-10-31T02:05:24Z

the connection to another three bootnodes is refused:

2021-10-31 02:04:46.972 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz
2021-10-31 02:04:46.973 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz") failed: Refused.
2021-10-31 02:04:46.974 DEBUG tokio-runtime-worker sync: [Parachain] 12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz disconnected

2021-10-31 02:06:32.190 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd
2021-10-31 02:06:32.191 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd") failed: Refused.
2021-10-31 02:06:32.191 DEBUG tokio-runtime-worker sync: [Parachain] 12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd disconnected

2021-10-31 02:07:03.711 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWL63x8ZPkY2ZekUqyvyNwsakwbuy8Rq3Dt9tJcxw5NFTt
2021-10-31 02:07:03.810 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWL63x8ZPkY2ZekUqyvyNwsakwbuy8Rq3Dt9tJcxw5NFTt") failed: Refused.

but seems no reason was given

GopherJ · 2021-10-31T02:18:45Z

substrate/client/network/src/protocol.rs

Lines 1480 to 1483 in 5e2b0f0

    
           RequestFailure::Refused => { 
        
           	self.peerset_handle.report_peer(*id, rep::REFUSED); 
        
           	self.behaviour.disconnect_peer(id, HARDCODED_PEERSETS_SYNC); 
        
           },

GopherJ · 2021-10-31T02:23:01Z

@tomaka sorry to ping you, we would like to fix this issue as it's on live, do you mind having a look?

GopherJ · 2021-10-31T02:43:46Z

just discoverred it's back now with 17 peers so this error happened for a few hours

GopherJ · 2021-11-01T01:18:34Z

this error happened again on one of our collator: https://telemetry.polkadot.io/#list/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b

tomaka · 2021-11-01T08:35:54Z

The Refused error happens if the reputation of the source node is too low, or if the node targeted by the request is overloaded and is uncapable of processing more block requests (to prevent DoS attacks)

GopherJ · 2021-11-01T09:13:59Z

The Refused error happens if the reputation of the source node is too low, or if the node targeted by the request is overloaded and is uncapable of processing more block requests (to prevent DoS attacks)

How to check the reputation? The latter shouldn't be our case because it's our own node and we didn't find any potential DDOS attack yet.

tomaka · 2021-11-01T11:03:34Z

You can use -l peerset=trace to see all reputation changes. If there's no error in the logs, it is likely that this doesn't relate to reputations. Instead, chances are that the CPU or disk the nodes are running on isn't powerful enough.

GopherJ · 2021-11-01T13:29:00Z

ok let me test

GopherJ · 2021-11-01T13:38:16Z

I saw this, from this disconnected collator's perspective, the others have a pretty bad reputation

If I check this reputation from other nodes' perspective, this disconnected node has 0 as reputation

tomaka · 2021-11-01T13:43:26Z

Reputations start being "bad" below approximately -2^28 if I remember correctly.
The reputations displayed here shouldn't have any consequence.

GopherJ · 2021-11-01T13:44:32Z

@tomaka so you are saying that we should increase VPS's size right. Currently we are using 8Cores + 16GB and 10Gbps for our collators, do you have suggestions?

tomaka · 2021-11-01T13:46:10Z

This should be more than enough, as far as I know.

In my opinion, something (i.e. some event) happened and caused the nodes to act very slowly for a few hours. It would be nice to figure out what this "something" is, but we don't have any clue here unfortunately.
Getting a more powerful machine would mitigate the problem if/when it happens again.

GopherJ · 2021-11-01T13:48:27Z

@tomaka ok we will investigate it and let you know, great thanks on the help

GopherJ · 2021-11-03T02:17:54Z

@tomaka we didn't find any attack, so I believe it's ok on this part, also we don't have any error logs yet, but we constantly have one collator disconnected with the others

https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b

this time it's heiko-collator-0

GopherJ · 2021-11-04T07:32:19Z

hi @tomaka I think there is a bug, restarting the nodes can help solving this for some hours but then it starts to lose peers again.

do you have some advice that we can apply, util now we don't have any error logs, the number of peers just go down periodically

GopherJ · 2021-11-08T07:18:05Z

this really doesn't look cool, what type of information can I post to help fixing this issue?

tomaka · 2021-11-08T09:33:14Z

Please check the CPU usage of the nodes and see if it correlates with the disconnections.

GopherJ · 2021-11-08T14:33:53Z

@tomaka

I think it looks totally fine. Actually we didn't find anything strange except Request to peerid failed: Refused

GopherJ · 2021-11-11T03:27:44Z

hi @tomaka we are still suffering from this issue, we are restarting our collators everyday:) it helps but not much

https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b

as you can see from the following, heiko-collator-4 and heiko-collator-7 got refused by the others today

GopherJ · 2021-11-18T05:27:15Z

I tried to use reserved-nodes but it's also disconnecting no matter what I do

GopherJ · 2021-11-22T16:16:08Z

after adding --state-cache-size 1 it looks much better now but not sure yet if it's resolved

GopherJ · 2021-11-23T11:10:58Z

since these three days our collators are working fine by adjusting --state-cache-size to 1, I think this issue can be closed

nazar-pc · 2021-11-30T02:17:28Z

I see the same issue on my network with a few nodes. There was nothing special happening there, node was working fine for a few days and suddenly that happened. Shouldn't it "just work" without need to disable state cache?

I think this should be reopened for further investigation, there is clearly something wrong if it happens on simple network with no misbehavior.

GopherJ · 2021-12-16T03:59:54Z

@tomaka @bkchr @nazar-pc I'm reopening this because there are definitely some issues with the cache.

bkchr · 2021-12-16T07:04:32Z

With what cache? The state cache?

nazar-pc · 2022-04-21T21:05:06Z

We have a simple and reliable way to reproduce it (though inconvenient).

With blockchain size (chains directory) of ~200G and 2 nodes available, third node joining the network is fairly quickly banned by other nodes. This happens every time long before 100G mark. Last attempt failed after just a few gigabytes.

Are there some parameters right now that can be tweaked to fix this behavior?

bkchr · 2023-02-13T18:22:33Z

@nazar-pc is this still relevant?

nazar-pc · 2023-02-13T18:29:19Z

Likely yes. What we did on our side is forking Substrate to decrease amount of data that block range response can contain and together with #12146 it got much better, but I don't believe it is solved fundamentally.

bkchr · 2023-02-13T21:15:13Z

What limit did you decrease?

nazar-pc · 2023-02-13T21:24:57Z

MAX_BODY_BYTES: 5f851f1

bkchr · 2023-02-14T19:52:12Z

CC @altonen the title here is probably misleading at this should not be related to the state-cache (which also not exists anymore today), but maybe some networking issue when it is improved by decreasing the mentioned max body size.

@nazar-pc could you provide your mentioned reproduction?

nazar-pc · 2023-02-14T20:10:06Z

I'm afraid it is no longer there as we are in the middle between previous and current testnet right now. I can request the team to try and reproduce it, but it may take a while.

The gist of our observation is the following: node was making request, but response was so big (initially significantly bigger than default MAX_BODY_SIZE due to old implementation logic) that node wasn't able to always pull it within keepalive timeout (I believe this is how it is called). With #12146 and MAX_BODY_BYTES decrease we achieved lower maximum expected bandwidth requirement from nodes before timeout kicks in and that way connection wasn't dropped. So the peers are connected to each other and they successfully transfer data, but because a single response takes so long (it shouldn't block TCP connection though because messages are fragmented by muxer anyway) it was having such effect as if connection was idling for a long time and timing out.

This was primarily reproducible in slow network conditions, like pulling something behind Chinese firewall from server in EU as one of easier to reproduce examples. Those who had strong connectivity didn't have issues, but we are supporting diverse set of users in our protocol and need to be prepared to these less than optimal conditions.

altonen · 2023-02-15T08:37:38Z

I can request the team to try and reproduce it, but it may take a while.

This would be much appreciated. @dmitry-markin @melekes could you take a look at this?

bkchr · 2023-02-15T08:55:15Z

So the peers are connected to each other and they successfully transfer data, but because a single response takes so long (it shouldn't block TCP connection though because messages are fragmented by muxer anyway) it was having such effect as if connection was idling for a long time and timing out.

Ty! Yeah maybe our time out mechanism isn't implemented properly. I don't know how it works exactly, but it should detect that there is progress being made and not require some kind of ping-pong (I think that is what we are doing) in these cases.

dmitry-markin · 2023-02-15T09:22:55Z

I can request the team to try and reproduce it, but it may take a while.

This would be much appreciated. @dmitry-markin @melekes could you take a look at this?

I can have a look at this and try reproducing the issue with timeouts on slow networks after merging some in-progress PRs.

dmitry-markin · 2023-02-27T14:34:45Z

May be connected issue: #12105

dmitry-markin · 2023-02-27T14:58:47Z

So far it looks like the theoretical bandwidth requirements based on MAX_BODY_BYTES of 8 MB and block request protocol timeout set to 20 s is 8 MB / 20 s = 3.4 Mbit/s. In my (arbitrary) tests with syncing polkadot chain, 1 Mbit/s was enough to download blocks around block number 14357815, but 512 kbit/s connection limit lead to block request protocol timing out after 20 seconds, and the peer we are syncing from being disconnected and backed-off. The things can be theoretically improved client-side by reducing the number of blocks requested in a single request in MAX_BLOCKS_TO_REQUEST. Reducing it from 64 to 32 made sync work in my setup on 512 kbit/s.

It looks like there is no solution other than "turning knobs here and there" without modifying libp2p, because libp2p considers the whole request to be timed out if the response was not received in full. On the other hand, I'm not sure that keeping connections alive if they can't meet our timeout requirements and just transmit something really makes sense (e.g., slowloris attack?)

nazar-pc · 2023-02-27T17:05:31Z

Yes, we have decreased requirements in our protocol to ~1Mbps, unfortunate things is that those are constants and we had to fork Substrate to make those changes.

bkchr · 2023-02-27T23:21:33Z

It looks like there is no solution other than "turning knobs here and there" without modifying libp2p, because libp2p considers the whole request to be timed out if the response was not received in full. On the other hand, I'm not sure that keeping connections alive if they can't meet our timeout requirements and just transmit something really makes sense (e.g., slowloris attack?)

But could we not be smarter by starting with less blocks, calculate the bandwidth and then increase the blocks to request based on what we measured? We would then only need some minimum bandwidth requirements.

dmitry-markin · 2023-02-28T09:28:09Z

But could we not be smarter by starting with less blocks, calculate the bandwidth and then increase the blocks to request based on what we measured? We would then only need some minimum bandwidth requirements.

In theory we could, but adaptive congestion control on the application layer would be more complicated than just calculating the bandwidth once and setting the block count in the request. We should also account for the change of network conditions, influence of parallel transmission of data by other protocols / parallel download from multiple peers (not sure if we have this), control stability issues. I.e., the congestion control algorithms in TCP are there for a reason.

We can implement some POC, but the algorithm must be extremely dumb and straightforward to not introduce new issues (especially I worry for the stability of the feedback loop).

altonen · 2023-02-28T09:52:39Z

I agree with @dmitry-markin here

There's already so much abstraction below the request-response protocol that implementing adaptive streaming all the way up there sounds quite difficult to get working properly, except for the dumb case. The request-response protocol would have imperfect information as to what is happening in the stream because other protocols are also being multiplexed and consume some amount of the available bandwidth. Also once the amount of bandwidth is over-estimated and too much data is sent, the connection is closed which is quite a bit more harsh than TCP dropping packets, adjusting window size and resending those packets.

Maybe we could make those knobs more available to people wouldn't have to fork Substrate.

bkchr · 2023-03-01T12:50:01Z

Good points! Was some random idea by me that I just wanted to write down. Hadn't put that much thoughts into it!

Maybe we could make those knobs more available to people wouldn't have to fork Substrate.

Yes! This wasn't really possible up to now, but now with sync almost being independent it should work!

Polkadot-Forum · 2023-05-02T08:44:11Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/april-updates-for-substrate-and-polkadot-devs/2764/1

GopherJ mentioned this issue Nov 17, 2021

Chainstate is not the same on different nodes paritytech/polkadot#4307

Closed

GopherJ closed this as completed Nov 23, 2021

GopherJ reopened this Dec 16, 2021

GopherJ changed the title ~~Request to peerId failed: Refused~~ Lose peers due to state-cache Dec 16, 2021

GopherJ changed the title ~~Lose peers due to state-cache~~ Lost peers due to state-cache Dec 16, 2021

nazar-pc mentioned this issue Feb 23, 2022

Node stuck and unable to sync autonomys/subspace#255

Closed

nazar-pc mentioned this issue Apr 24, 2022

Node is banned by other node during sync autonomys/subspace#391

Closed

dmitry-markin self-assigned this Feb 16, 2023

altonen mentioned this issue Apr 5, 2023

Make blocks per request configurable #13824

Merged

paritytech-processbot bot closed this as completed in #13824 Apr 7, 2023

Lost peers due to state-cache #10130

Lost peers due to state-cache #10130

Comments

GopherJ commented Oct 31, 2021

GopherJ commented Oct 31, 2021

GopherJ commented Oct 31, 2021 • edited Loading

GopherJ commented Oct 31, 2021

GopherJ commented Oct 31, 2021

GopherJ commented Oct 31, 2021 • edited Loading

GopherJ commented Nov 1, 2021

tomaka commented Nov 1, 2021

GopherJ commented Nov 1, 2021

tomaka commented Nov 1, 2021 • edited Loading

GopherJ commented Nov 1, 2021

GopherJ commented Nov 1, 2021 • edited Loading

tomaka commented Nov 1, 2021

GopherJ commented Nov 1, 2021

tomaka commented Nov 1, 2021 • edited Loading

GopherJ commented Nov 1, 2021

GopherJ commented Nov 3, 2021

GopherJ commented Nov 4, 2021 • edited Loading

GopherJ commented Nov 8, 2021

tomaka commented Nov 8, 2021

GopherJ commented Nov 8, 2021 • edited Loading

GopherJ commented Nov 11, 2021 • edited Loading

GopherJ commented Nov 18, 2021 • edited Loading

GopherJ commented Nov 22, 2021

GopherJ commented Nov 23, 2021

nazar-pc commented Nov 30, 2021

GopherJ commented Dec 16, 2021

bkchr commented Dec 16, 2021

nazar-pc commented Apr 21, 2022

bkchr commented Feb 13, 2023

nazar-pc commented Feb 13, 2023

bkchr commented Feb 13, 2023

nazar-pc commented Feb 13, 2023

bkchr commented Feb 14, 2023

nazar-pc commented Feb 14, 2023

altonen commented Feb 15, 2023

bkchr commented Feb 15, 2023

dmitry-markin commented Feb 15, 2023 • edited Loading

dmitry-markin commented Feb 27, 2023

dmitry-markin commented Feb 27, 2023 • edited Loading

nazar-pc commented Feb 27, 2023

bkchr commented Feb 27, 2023

dmitry-markin commented Feb 28, 2023

altonen commented Feb 28, 2023 • edited Loading

bkchr commented Mar 1, 2023

Polkadot-Forum commented May 2, 2023

GopherJ commented Oct 31, 2021 •

edited

Loading

GopherJ commented Oct 31, 2021 •

edited

Loading

tomaka commented Nov 1, 2021 •

edited

Loading

GopherJ commented Nov 1, 2021 •

edited

Loading

tomaka commented Nov 1, 2021 •

edited

Loading

GopherJ commented Nov 4, 2021 •

edited

Loading

GopherJ commented Nov 8, 2021 •

edited

Loading

GopherJ commented Nov 11, 2021 •

edited

Loading

GopherJ commented Nov 18, 2021 •

edited

Loading

dmitry-markin commented Feb 15, 2023 •

edited

Loading

dmitry-markin commented Feb 27, 2023 •

edited

Loading

altonen commented Feb 28, 2023 •

edited

Loading