-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rococo asset-hub and others don't always advertise their collations #3519
Comments
Checked a few ocassions it seems the collators can't dial to the validator they are assigned to, some examples: Collator was assigned to validator The validators seem to be receiving collations when they get assigned to other parachains, e.g: https://grafana.teleport.parity.io/goto/rM-xC9TIR?orgId=1
@dmitry-markin @altonen @BulatSaif any thoughts if this is a bug in our connecting logic or just an infrastructure problem where we don't have full connectivity to some nodes. It seems like |
I was looking into
There's a I looked at some of the dial failures and I don't think they're a node problem, except maybe #498 which I also saw in some of the dial failures. But I agree that they look more like a network problem. |
Port is incorrect it should be
Full cli:
|
This logs looks interesting: Node see correct port:
but later gets new external address with incorrect port.
|
With subexp-explorer this is what I get:
It seems to be it is advertising everything, although all of them seem to be tried:
|
Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it. |
@BulatSaif could you also enable, thank you!
|
One reason for collation not being advertised in this case is a missing PeerView update from the validator. |
Regarding the connectivity, definitely something seem to be off, for example I can telnet to some of the ports here:
But I can't telnet here:
|
Done enabled on |
My bad I wanted it on the asset-hub-collators, could you please add it there as well. Thank you |
Done ✔️ |
This was spot-on, it seems that collator and validator are not in sync when it comes to You can see that the collator Connects and Disconnects every time it has something to send to the validator, but the validator bridge doesn't get the same signals. The validator keeps thinking the collator is connected while the collator thinks it is disconnected from the validator, that happens for a few seconds and it recovers after a few blocks. So, I think we've got two problems on this flow:
|
One possible fix would to be to keep connections to the backing group if it doesn't change. From validator perspective we would only disconnect inactive collators after 4 blocks. so we should be fine.
|
The disconnects on collator side happen because of this: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/src/protocol_controller.rs#L520
Because collator removes all the peers out of reserved_peers every RECONNECT_TIMEOUT. I guess a quick workaround to prevent this condition would be to increase |
Nothing prevents other nodes from taking those extra slots so I'm not sure effective that fix is. If you want to keep connected to the peer, why can't it be kept as a reserved peer? |
I agree that we should keep them as reserved, I was just looking for a quick solution to test that if collators don't disconnect often from validators we prevent them getting out of sync and miss advertising collations. My question for you @altonen is the out of sync that I describe above just a flavour of the known issue that you described here #2995 (comment) ? |
Yeah it could be the same issue, at least by the sound of it. There is actually a plan to finally fix it, here's some information: https://github.com/paritytech/devops/issues/3164#issuecomment-1968623373 If you want/can, you could try deploying #3426 on one of the rococo collators/validators and see if the issue disappears. If the issue is the one discussed in #2995, it should disappear with #3426 |
Looking at rococo-asset-hub there seems to be a lot of instances where collator did not advertise their collations, while there multiple problems there, one of it is that we are connecting and disconnecting to our assigned validators every block, because on reconnect_timeout every 4s we call connect_to_validators and that will produce 0 validators when all went well, so set_resevert_peers called from validator discovery will disconnect all our peers. More details here: #3519 (comment) Now, this shouldn't be a problem, but it stacks with an existing bug in our network stack where if disconnect from a peer the peer might not notice it, so it won't detect the reconnect either and it won't send us the necessary view updates, so we won't advertise the collation to it more details here: #3519 (comment) To avoid hitting this condition that often, let's keep the peers in the reserved set for the entire duration we are allocated to a backing group. Backing group sizes(1 rococo, 3 kusama, 5 polkadot) are really small, so this shouldn't lead to that many connections. Additionally, the validators would disconnect us any way if we don't advertise anything for 4 blocks. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Looking at rococo-asset-hub there seems to be a lot of instances where collator did not advertise their collations, while there multiple problems there, one of it is that we are connecting and disconnecting to our assigned validators every block, because on reconnect_timeout every 4s we call connect_to_validators and that will produce 0 validators when all went well, so set_resevert_peers called from validator discovery will disconnect all our peers. More details here: #3519 (comment) Now, this shouldn't be a problem, but it stacks with an existing bug in our network stack where if disconnect from a peer the peer might not notice it, so it won't detect the reconnect either and it won't send us the necessary view updates, so we won't advertise the collation to it more details here: #3519 (comment) To avoid hitting this condition that often, let's keep the peers in the reserved set for the entire duration we are allocated to a backing group. Backing group sizes(1 rococo, 3 kusama, 5 polkadot) are really small, so this shouldn't lead to that many connections. Additionally, the validators would disconnect us any way if we don't advertise anything for 4 blocks. Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@BulatSaif: Can you help me with deploying asset-hub on rococo from https://github.com/paritytech/polkadot-sdk/pull/3544/files, so I can check if it improves the situations. |
Over the weekend I see a lot of dial failures because of this
|
It's a peer ID mismatch which can happen because the DHT contains stale peer records and when the local node attempts to dial a peer with a stale record, the dial may succeed but the Noise handshake fails because the peer is using a new ID. |
Looked a bit on this error: https://grafana.teleport.parity.io/goto/6Nz6bIASR?orgId=1
When I look with subp2p-explorer, there are multiple peers reporting the same IP and port, I wouldn't expect that to happen, e.g
Is this something expected ? |
…up (#3544) Looking at rococo-asset-hub #3519 there seems to be a lot of instances where collator did not advertise their collations, while there are multiple problems there, one of it is that we are connecting and disconnecting to our assigned validators every block, because on reconnect_timeout every 4s we call connect_to_validators and that will produce 0 validators when all went well, so set_reseverd_peers called from validator discovery will disconnect all our peers. More details here: #3519 (comment) Now, this shouldn't be a problem, but it stacks with an existing bug in our network stack where if disconnect from a peer the peer might not notice it, so it won't detect the reconnect either and it won't send us the necessary view updates, so we won't advertise the collation to it more details here: #3519 (comment) To avoid hitting this condition that often, let's keep the peers in the reserved set for the entire duration we are allocated to a backing group. Backing group sizes(1 rococo, 3 kusama, 5 polkadot) are really small, so this shouldn't lead to that many connections. Additionally, the validators would disconnect us any way if we don't advertise anything for 4 blocks. ## TODO - [x] More testing. - [x] Confirm on rococo that this is improving the situation. (It doesn't but just because other things are going wrong there). --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
UpdatesThe current state of affairs is that it seems that validators are publishing in DHT ports and IPs that are not meant to be used from outside, so collators will randomly fail to connect to validators when using this ports and IPs: The addresses the collators sees for that node, these are all the IPs that the validator
The fix here should be to find a way not to expose externally this addresses, @altonen suggested something like this:
@BulatSaif @PierreBesson @altonen : Feel free to add more context. |
Maybe we need to prune more eagerly old addresses. However, I don't get how the collator is not able to connect to the validator? Or do you want to tell me that none of the addresses in this list is actually the address of the validator right now? Otherwise I would expect that when I want to connect to the node, it will try all the addresses. |
I'm really noob around that part, so I'm fine with anything that would get us out of this flaky state.
In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.
However, in the past days I did find cases where even when trying an expected port the dial was failing. |
There are three things we should try:
Bullet 3. I understand the least, as the node should report at least one correct address. This one is IMO the most important one, as it directly affects the connectivity. |
libp2p concurrently dials 8 known addresses: In substrate, we just log such errors once the connection is finally established: polkadot-sdk/substrate/client/network/src/service.rs Lines 1513 to 1514 in b0f34e4
If we expect nodes to have more than 8 addresses in DHT, it probably makes sense to increase the concurrency factor, as we currently don't set the transport timeout for p2p connections (and with default TCP SYN retries it's about 2 mins on linux). I would also set the transport timeout to some defined smaller value, like 15-20 secs. |
Why the validators are changing their public addresses so frequently in the first place? Is it because they run on spot instances? May be we shouldn't use spot instances for validators? |
The arguments this nodes are started are:
From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on. |
Extending Alex's reply, my understanding of the issue is that the pod private port 30333 is accidentally exposed, instead of the public address port, 32685. The private port could be exposed either by port reuse, where by it binds to 30333 and the remote nodes somehow observe the node dialing from this address, or through address translation (link, link), where it translates the observed address to its listen address. To verify what is happening, you may need libp2p trace logs from both the listener and dialer. FWIW, I think there were two different issues: peer ID mismatch and connection refused. |
Remote nodes learn about our addresses via Identify protocol and add reported addresses to DHT. rust-libp2p combines listen addresses with external addresses and sends all of them via Identify message: polkadot-sdk/substrate/client/network/src/service.rs Lines 1415 to 1419 in b0f34e4
What we can try doing to ensure remote nodes do not learn our listen addresses (and do not add them to DHT), is not report them to Identify behavior here:
|
In this case it's not (only) an accident, but the current logic of Identify, which always publishes our listen addresses in addition to observed/set external addresses. |
@BulatSaif is our chief burn-in officer, can you please deploy asset-hub with the above PR. |
Looking a bit at the logs Alex provided above:
Advertising DHT addresses
TLDR; I don't think #3657 solves the issue here. @altonen @dmitry-markin could you double check please? 🙏 Different PeerID but Same Addresss
In this case, the DHT record contains the address I'd have to look again at the subp2p-explorer to make sure I report the data correctly there. Although Alex mentioned #3519 (comment) that the This could happen when the nodes that we manage share the external address. And they go on this code path:
From this line, if the I think this could happen if the nodes are placed under the same network (likely)? |
Here how you can reproduce issue locally:
The second discovered address is incorrect, 212.227.198.212 is your ISP IP. This mainly affects testnets since they are usually deployed within a container environment, such as Kubernetes. However, I've also seen cases where comunity mainet RPC bootnodes were deployed in a private network and exposed with port forwarding on the edge router they have same issue. |
@lexnv My understanding of the issue is that the validators are run on spot instances, which have both:
This ephemeral address is reported via Identify to remote nodes, and they add it to DHT because it's global. And this ephemeral address changes periodically, and gets added to DHT again and again. As a result, we have plenty of obsolete addresses in the DHT. And as we can see from #3519 (comment) all those addresses are global — they are the addresses the pods had at some point. |
@alexggh @dmitry-markin #3656 is deployed on asset-hub-rococo |
Looking at the logs from grafana:
|
@alexggh just to clarity, how is the listen address mapped to the public address (i.e., why dialing 35.204.131.85:32685 would reach the node)? Is there some dnat / port forwarding rule in the firewall? |
@BulatSaif Should be able to help you here ^^, I'm not from the infra team, so I'm not able to explain to you our networking setup. |
We use Kubernetes NodePort, which has various implementations, but the simplest one involves iptables rules. Any traffic received on the specific port by any VM in the cluster will be forwarded to the correct pod. |
We had a discussion with @BulatSaif about validators' networking in kubernetes on spot instances, but it's still not completely clear how multiple external addresses end up in the authority discovery DHT record. For sure, there are two issues:
I'm working on a fix #3657 that will resolve issue 1 once extended on authority discovery DHT records and all addresses, including external addresses not specified with |
@BulatSaif can we deploy up to date master to one of the validators to see what #3668 is reporting? This is needed to rule out that the validator is publishing rogue addresses into the DHT authority discovery record.
|
|
@bkchr root-caused this parts of this issue here: #3673 (comment) I think because of this, we are going to truncate out the good addresses. polkadot-sdk/substrate/client/authority-discovery/src/worker.rs Lines 571 to 572 in a6713c5
|
…ritytech#3757) Make sure explicitly set by the operator public addresses go first in the authority discovery DHT records. Also update `Discovery` behavior to eliminate duplicates in the returned addresses. This PR should improve situation with paritytech#3519. Obsoletes paritytech#3657.
Looking at block times for rococo-asset-hub and others we can see that there are moment when they decide not to advertise their collection to validators, the logs point to that:
On rococo the validators groups size is 1 so there isn't a lot of redundancy, however that seems to happen way too often to be ignored, additionally the asset-hub has async backing enabled, however it does not seem to be related because other parachains seem to be affected as well.
Looking at the logs it seems like sometimes the collators can't connect to their assigned validators.
Next steps
The text was updated successfully, but these errors were encountered: