Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rococo asset-hub and others don't always advertise their collations #3519

Open
1 task done
alexggh opened this issue Feb 29, 2024 · 48 comments
Open
1 task done

rococo asset-hub and others don't always advertise their collations #3519

alexggh opened this issue Feb 29, 2024 · 48 comments

Comments

@alexggh
Copy link
Contributor

alexggh commented Feb 29, 2024

Looking at block times for rococo-asset-hub and others we can see that there are moment when they decide not to advertise their collection to validators, the logs point to that:


2024-02-22 00:29:11.176 | 2024-02-21 22:29:11.154 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Advertising collation. relay_parent=0x4d5b87ffa8d9cba09462a460539fcedbdf2bb876ebfc736a283978a0383348c8 candidate_hash=0xeb25ac57d43f3eccad16b0a5cae23054e72207c669da69972665462e56ce5cd4 peer_id=12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn traceID=312564189526210266554051532702187925588 |  
-- | -- | --
  |   | 2024-02-22 00:29:11.176 | 2024-02-21 22:29:11.154 DEBUG tokio-runtime-worker parachain::collator-protocol: [Relaychain] Advertising collation. relay_parent=0x244c7422a2e3e542fbe96e1d3d0ee3c53af16278516294dd0094e7f8b3bba9b2 candidate_hash=0xaad65ae4deaa25590db854854d95480b03118f0864e1b66fa11053b26cf3321c peer_id=12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn traceID=227081754360958899348741667046089377803 |  
  |   | 2024-02-22 00:29:06.180 | 2024-02-21 22:29:06.168  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x0fdff979355ef2974910c0cdb9c1736dc2cab86e56e1fc7a760f0b5295fb1fde pov_hash=0x65327906ba3fbb2144ab19cb4eb9c91a9c09d39652f2a4e5f0ab5a83bd82f415 traceID=21101362059343468297627011593342645101 |  
  |   | 2024-02-22 00:29:00.208 | 2024-02-21 22:29:00.160  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0xd0a234db3548163f63b8736ff0592e77a655c551249f4e2aa1e6ed35869ca893 pov_hash=0xe7d9cf62f1163a261a9cc5b455118394c51e8dc8d54c4b6c52d563268211a141 traceID=277321647267101977391502449035806584439 |  
  |   | 2024-02-22 00:28:54.196 | 2024-02-21 22:28:54.162  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x10d8dbe219d596fe4179e98d133ae619a52ba5e4134fd5beee4b2b42d5b79db5 pov_hash=0x159b57ed3f55b321ddcd1eb0e5d3b219de08e3d46dcb6cdf7fe6e490878addf4 traceID=22393643815265444164061667538671167001 |  
  |   | 2024-02-22 00:28:48.186 | 2024-02-21 22:28:48.140  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x476bc46903194d9d49c425875dbea01daa02183f921b0a4dc873dbe0ffd995c1 pov_hash=0x8046229d7c9b29f713a1e5a91a9a721853dd9ecc6d02302400911e2c70b9d711 traceID=94934747136790677956209421564648529949 |  
  |   | 2024-02-22 00:28:48.186 | 2024-02-21 22:28:48.140  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0xdd491d0c0d7328968a652ba1f8d6557d3c173b590fbfcbb4fb8cd985b254abcc pov_hash=0x8dfc7ee468e94b0b6c4a8c44a0e57502c0416fb13521be4eaa492f97d5e530fd traceID=294139013883918429666814951162104927613 |  
  |   | 2024-02-22 00:28:48.036 | 2024-02-21 22:28:47.996  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0xe29f3817fb510ea6eb22c1ec0720ce958c6d191a7deb0117e8897b8214556ef4 pov_hash=0xb4c770cf5ec29a2d85791dece202f00d445c15219ae24a1323f314a1a9751d85 traceID=301232239962862296839600241828773809813

On rococo the validators groups size is 1 so there isn't a lot of redundancy, however that seems to happen way too often to be ignored, additionally the asset-hub has async backing enabled, however it does not seem to be related because other parachains seem to be affected as well.

Looking at the logs it seems like sometimes the collators can't connect to their assigned validators.

Next steps

  • Run 1 or two collators with trace enabled.
sub-libp2p=trace
peerset=trace
@alexggh alexggh changed the title rococo asset-hub and other don't always advertise their collations rococo asset-hub and others don't always advertise their collations Feb 29, 2024
@alexggh
Copy link
Contributor Author

alexggh commented Feb 29, 2024

Checked a few ocassions it seems the collators can't dial to the validator they are assigned to, some examples:

Collator was assigned to validator rococo-validator-a-node-7 https://grafana.teleport.parity.io/goto/dmUVC9oIg?orgId=1
Collator was assigned to validator rococo-validator-h-node-5 https://grafana.teleport.parity.io/goto/nWjOjroIg?orgId=1

The validators seem to be receiving collations when they get assigned to other parachains, e.g: https://grafana.teleport.parity.io/goto/rM-xC9TIR?orgId=1

Reach failure for Some(PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH")) through "/ip4/35.204.235.25/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH": Other(Custom { kind: Other, error: Custom { kind: Other, error: Timeout } }) 

@dmitry-markin @altonen @BulatSaif any thoughts if this is a bug in our connecting logic or just an infrastructure problem where we don't have full connectivity to some nodes.

It seems like rococo-rockmine-collator-experimental-node-0 https://grafana.teleport.parity.io/goto/BUtrj9TIg?orgId=1 can connect to the same validators but not the collators producing blocks, so this leads me to think that it is just a network partition problem.

@altonen
Copy link
Contributor

altonen commented Feb 29, 2024

I was looking into rococo-rockmine-collator-node-0 logs and there's something in the logs that I don't understand (logs are in reverse):

2024-02-29 10:28:06.117 parachain::network-bridge-rx: [Relaychain] action="PeerDisconnected" peer_set=Collation peer=PeerId("12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d")
2024-02-29 10:28:06.117 peerset: [Relaychain] Reserved peer 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d (Outbound) dropped from SetId(3).    
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] External API <= Closed(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))    
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] PSM <= Dropped(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))    
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] Handler(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, ConnectionId(9881)) <= Close(SetId(3))    
2024-02-29 10:28:06.117 sub-libp2p: [Relaychain] Handler(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, ConnectionId(9881)) => CloseDesired(SetId(3))    
2024-02-29 10:28:04.481 sub-libp2p: [Relaychain] Libp2p => Disconnected(12D3KooWQELiM7oVu7tQPgj51rWN733a1FoDEg2XVHQhzJyUCXYR, SetId(3), ConnectionId(9856)): Disabled.    
2024-02-29 10:28:04.104 parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=1 removed=0
2024-02-29 10:28:04.104 parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=1 failed_to_resolve=0
2024-02-29 10:28:04.100 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWRsMJngwJwc614i8Rbik2nsm9d4NuKTcY8cvRJH1bupsE, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/23006" }, ConnectionId(9910)): Not requested by PSM, disabling.    
2024-02-29 10:28:03.022  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x0f105c1e361e7ddba3e93e02820f39d50b21f3a1c694cf3962bc9e8d94e1f9d8 pov_hash=0xf3d1343ed98b05a2b765556253d3ba7c64308920182b74103eeadeb59cc67c42 traceID=20023365061787759369622147911581120981
2024-02-29 10:28:01.555 sub-libp2p: [Relaychain] Libp2p => Disconnected(12D3KooWJ73BzHGy5WYG3tB6KWoT184Jtp8QhSuHQgjDWUWVMZy6, SetId(3), ConnectionId(9907)): Disabled.    
2024-02-29 10:28:01.469 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWJ73BzHGy5WYG3tB6KWoT184Jtp8QhSuHQgjDWUWVMZy6, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/33269" }, ConnectionId(9907)): Secondary connection. Leaving closed.    
2024-02-29 10:28:01.469 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWJ73BzHGy5WYG3tB6KWoT184Jtp8QhSuHQgjDWUWVMZy6, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/62370" }, ConnectionId(9906)): Not requested by PSM, disabling.    
2024-02-29 10:28:01.150 sub-libp2p: [Relaychain] Libp2p => Connected(12D3KooWGX1RBgFn1rqAJDr1X5gm1wBcFPziAqXR7eTHYhBTBBE2, SetId(3), Listener { local_addr: "/ip4/10.20.73.24/tcp/30333", send_back_addr: "/ip4/10.20.73.1/tcp/46197" }, ConnectionId(9905)): Not requested by PSM, disabling.    
2024-02-29 10:28:00.829 sub-libp2p: [Relaychain] External API <= Message(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))    
2024-02-29 10:28:00.829 sub-libp2p: [Relaychain] Handler(ConnectionId(9881)) => Notification(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3), 38 bytes)    
2024-02-29 10:28:00.828 parachain::network-bridge-rx: [Relaychain] action="PeerConnected" peer_set=Collation version=2 peer=PeerId("12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d") role=Authority
2024-02-29 10:28:00.828 sub-libp2p: [Relaychain] External API <= Open(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3))    
2024-02-29 10:28:00.828 sub-libp2p: [Relaychain] Handler(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, ConnectionId(9881)) => OpenResultOk(SetId(3))    
2024-02-29 10:28:00.827 sub-libp2p: [Relaychain] Handler(PeerId("12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d"), ConnectionId(9881)) <= Open(SetId(3))    
2024-02-29 10:28:00.827 sub-libp2p: [Relaychain] PSM => Connect(12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d, SetId(3)): Enabling connections.    
2024-02-29 10:28:00.826 peerset: [Relaychain] Connecting to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d on SetId(3) (0/0 num_out/max_out).    
2024-02-29 10:28:00.152 parachain::validator-discovery: [Relaychain] New ConnectToValidators resolved request peer_set=Collation num_peers=1 removed=0
2024-02-29 10:28:00.152 parachain::validator-discovery: [Relaychain] New ConnectToValidators request peer_set=Collation requested=1 failed_to_resolve=0

There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?

I looked at some of the dial failures and I don't think they're a node problem, except maybe #498 which I also saw in some of the dial failures. But I agree that they look more like a network problem.

@BulatSaif
Copy link
Contributor

BulatSaif commented Feb 29, 2024

Reach failure for Some(PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH")) through "/ip4/35.204.235.25/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH": Other(Custom { kind: Other, error: Custom { kind: Other, error: Timeout } }) 

@dmitry-markin @altonen @BulatSaif any thoughts if this is a bug in our connecting logic or just an infrastructure problem where we don't have full connectivity to some nodes.

Port is incorrect it should be 32685 not 30333. Fixed ports are only on bootnodes.
Also rococo validator are on spot instance and may restart and change VM(IP).
We use this two flags to set address:

 --public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333

30333 is internal port and should not be shared

Full cli:

polkadot --name=rococo-validator-a-node-7 --base-path=/chain-data --keystore-path=/keystore --chain=rococo --validator --database=paritydb --state-pruning=1000 --prometheus-external --prometheus-port 9615 --unsafe-rpc-external --rpc-port=9944 --rpc-cors=all --rpc-methods=unsafe --node-key-file /keystore/node-key --log=parachain=debug --insecure-validator-i-know-what-i-do --enable-offchain-indexing true --public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333

@BulatSaif
Copy link
Contributor

BulatSaif commented Feb 29, 2024

This logs looks interesting:

Node see correct port:

2024-02-28 15:58:08.055  INFO main sub-libp2p: 🔍 Discovered new external address for our node: /ip4/34.90.232.201/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH    

but later gets new external address with incorrect port.

2024-02-28 15:58:08.652  INFO tokio-runtime-worker sub-libp2p: 🔍 Discovered new external address for our node: /ip4/34.90.232.201/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH    

@alexggh
Copy link
Contributor Author

alexggh commented Feb 29, 2024

With subexp-explorer this is what I get:

authority="5C5jWPr5ygq4VgwfgsnrcHjp8H11xMT1gYXqR9a4oDJ7CeWF" 
peer_id=PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH") 
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-a-node-7)" 
version={
"/ip4/34.91.211.207/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.32.166.239/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.32.166.239/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.91.211.207/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH"} 

It seems to be it is advertising everything, although all of them seem to be tried:

Failed to reach PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH"): Failed to negotiate transport protocol(s): 
[(/ip4/35.204.235.25/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH: : Timeout has been reached)
(/ip4/35.204.235.25/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH: : Timeout has been reached)]  

@alexggh
Copy link
Contributor Author

alexggh commented Feb 29, 2024

There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?

Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it.

@alexggh
Copy link
Contributor Author

alexggh commented Feb 29, 2024

@BulatSaif could you also enable, thank you!

parachain=trace

@sandreim
Copy link
Contributor

There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?

Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it.

One reason for collation not being advertised in this case is a missing PeerView update from the validator.

@alexggh
Copy link
Contributor Author

alexggh commented Feb 29, 2024

Regarding the connectivity, definitely something seem to be off, for example I can telnet to some of the ports here:

authority="5C5HkAmbAGC7nqv7Uo2o3LFxBsFZ4Q62av6L7YDJnD56oGq1" 
peer_id=PeerId("12D3KooWNrVUY3nUPeFdVRvQcHy8VPtneKLoTAAoz2RE911RLkx3") 
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-9)" 
version={"/ip4/35.204.163.222/tcp/30333/p2p/12D3KooWNrVUY3nUPeFdVRvQcHy8VPtneKLoTAAoz2RE911RLkx3", 
"/ip4/35.204.163.222/tcp/30156/p2p/12D3KooWNrVUY3nUPeFdVRvQcHy8VPtneKLoTAAoz2RE911RLkx3"} 

But I can't telnet here:

authority="5C5jWPr5ygq4VgwfgsnrcHjp8H11xMT1gYXqR9a4oDJ7CeWF" 
peer_id=PeerId("12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH") 
addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-a-node-7)" 
version={"/ip4/34.91.211.207/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.91.211.207/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.32.166.239/tcp/30333/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH", 
"/ip4/34.32.166.239/tcp/32685/p2p/12D3KooWPBidoLJycPNyFALgdvg6bwzWyH8VXEctrQ4gak54EJtH"
} 

@BulatSaif
Copy link
Contributor

@BulatSaif could you also enable, thank you!

parachain=trace

Done enabled on rococo-validator-a.*

@alexggh
Copy link
Contributor Author

alexggh commented Feb 29, 2024

@BulatSaif could you also enable, thank you!

parachain=trace

Done enabled on rococo-validator-a.*

My bad I wanted it on the asset-hub-collators, could you please add it there as well. Thank you

@BulatSaif
Copy link
Contributor

My bad I wanted it on the asset-hub-collators, could you please add it there as well. Thank you

Done ✔️

@alexggh
Copy link
Contributor Author

alexggh commented Mar 1, 2024

There's a ConnectToValidators request, substream is opened successfully and the peer is reported to NetworkBridge but the collation is still not advertised. Why is the collation not getting advertised to 12D3KooWLDfH9mHRCidrd5NfQjp7rRMUcJSEUwSvEKyu7xU2cG3d?

Yeah, that seems to be a problem as well, so there are two patterns, one when we can't connect to the validator and the other where it seems that even if collator is connected we aren't able to advertise it.

One reason for collation not being advertised in this case is a missing PeerView update from the validator.

This was spot-on, it seems that collator and validator are not in sync when it comes to PeerConnected events for each other, this is why even when if connection between collator and validator the collation is not advertised, e.g:

You can see that the collator Connects and Disconnects every time it has something to send to the validator, but the validator bridge doesn't get the same signals.

The validator keeps thinking the collator is connected while the collator thinks it is disconnected from the validator, that happens for a few seconds and it recovers after a few blocks.

So, I think we've got two problems on this flow:

  1. Why are disconnecting so often from validators we got allocated to, and the answer is this code: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/network/collator-protocol/src/collator_side/mod.rs#L1472 which gets called every 4 seconds and previous collation had been advertised so we emit ConnectToValidators with 0 peers which will disconnect the currently connected ones.

  2. Even with 1, why are they getting, out of sync, is should work wouldn't it ?

@sandreim
Copy link
Contributor

sandreim commented Mar 1, 2024

One possible fix would to be to keep connections to the backing group if it doesn't change. From validator perspective we would only disconnect inactive collators after 4 blocks. so we should be fine.

CollatorEvictionPolicy {
			inactive_collator: Duration::from_secs(24),
			undeclared: Duration::from_secs(1),
		}

@alexggh
Copy link
Contributor Author

alexggh commented Mar 1, 2024

The disconnects on collator side happen because of this: https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/src/protocol_controller.rs#L520

			let disconnect = self.reserved_only ||
				match direction {
					Direction::Inbound => self.num_in >= self.max_in,
					Direction::Outbound => self.num_out >= self.max_out,
				};

			if disconnect {
				// Disconnect the node.
				trace!(
					target: LOG_TARGET,
					"Disconnecting previously reserved node {peer_id} ({direction:?}) on {:?}.",
					self.set_id,
				);
				self.drop_connection(peer_id);

Because collator removes all the peers out of reserved_peers every RECONNECT_TIMEOUT. I guess a quick workaround to prevent this condition would be to increase max_in and max_out for collators, @BulatSaif any idea what values we use there for asset-hub ?

@altonen
Copy link
Contributor

altonen commented Mar 1, 2024

Nothing prevents other nodes from taking those extra slots so I'm not sure effective that fix is. If you want to keep connected to the peer, why can't it be kept as a reserved peer?

@alexggh
Copy link
Contributor Author

alexggh commented Mar 1, 2024

Nothing prevents other nodes from taking those extra slots so I'm not sure effective that fix is. If you want to keep connected to the peer, why can't it be kept as a reserved peer?

I agree that we should keep them as reserved, I was just looking for a quick solution to test that if collators don't disconnect often from validators we prevent them getting out of sync and miss advertising collations.

My question for you @altonen is the out of sync that I describe above just a flavour of the known issue that you described here #2995 (comment) ?

@altonen
Copy link
Contributor

altonen commented Mar 1, 2024

Yeah it could be the same issue, at least by the sound of it. There is actually a plan to finally fix it, here's some information: https://github.com/paritytech/devops/issues/3164#issuecomment-1968623373

If you want/can, you could try deploying #3426 on one of the rococo collators/validators and see if the issue disappears. If the issue is the one discussed in #2995, it should disappear with #3426

alexggh added a commit that referenced this issue Mar 1, 2024
Looking at rococo-asset-hub there seems to be a lot of instances where
collator did not advertise their collations, while there multiple
problems there, one of it is that we are connecting and disconnecting to
our assigned validators every block, because on reconnect_timeout every
4s we call connect_to_validators and that will produce 0 validators when
all went well, so set_resevert_peers called from validator discovery
will disconnect all our peers.
More details here: #3519 (comment)

Now, this shouldn't be a problem, but it stacks with an existing bug in
our network stack where if disconnect from a peer the peer might not
notice it, so it won't detect the reconnect either and it won't send us
the necessary view updates, so we won't advertise the collation to it
more details here:
#3519 (comment)

To avoid hitting this condition that often, let's keep the peers in the
reserved set for the entire duration we are allocated to a backing
group. Backing group sizes(1 rococo, 3 kusama, 5 polkadot) are really small,
so this shouldn't lead to that many connections. Additionally, the
validators would disconnect us any way if we don't advertise anything
for 4 blocks.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
alexggh added a commit that referenced this issue Mar 1, 2024
Looking at rococo-asset-hub there seems to be a lot of instances where
collator did not advertise their collations, while there multiple
problems there, one of it is that we are connecting and disconnecting to
our assigned validators every block, because on reconnect_timeout every
4s we call connect_to_validators and that will produce 0 validators when
all went well, so set_resevert_peers called from validator discovery
will disconnect all our peers.
More details here: #3519 (comment)

Now, this shouldn't be a problem, but it stacks with an existing bug in
our network stack where if disconnect from a peer the peer might not
notice it, so it won't detect the reconnect either and it won't send us
the necessary view updates, so we won't advertise the collation to it
more details here:
#3519 (comment)

To avoid hitting this condition that often, let's keep the peers in the
reserved set for the entire duration we are allocated to a backing
group. Backing group sizes(1 rococo, 3 kusama, 5 polkadot) are really small,
so this shouldn't lead to that many connections. Additionally, the
validators would disconnect us any way if we don't advertise anything
for 4 blocks.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Mar 4, 2024

@BulatSaif: Can you help me with deploying asset-hub on rococo from https://github.com/paritytech/polkadot-sdk/pull/3544/files, so I can check if it improves the situations.

@alexggh
Copy link
Contributor Author

alexggh commented Mar 4, 2024

Over the weekend I see a lot of dial failures because of this Unexpected peer ID :

Libp2p => Failed to reach PeerId("12D3KooWRFeDwgX2BHz2jstWUFj61zQFfCZzed6ZbVqfM5K5Wo5k"): Dial error: Unexpected peer ID 12D3KooWP93Dzk8T7GWxyWw9jhLcz8Pksokk3R9vL2eEH337bNkT at Dialer { address: "/ip4/34.147.118.161/tcp/30333/p2p/12D3KooWRFeDwgX2BHz2jstWUFj61zQFfCZzed6ZbVqfM5K5Wo5k", role_override: Dialer }.   

@altonen
Copy link
Contributor

altonen commented Mar 4, 2024

It's a peer ID mismatch which can happen because the DHT contains stale peer records and when the local node attempts to dial a peer with a stale record, the dial may succeed but the Noise handshake fails because the peer is using a new ID.

@alexggh
Copy link
Contributor Author

alexggh commented Mar 4, 2024

Looked a bit on this error: https://grafana.teleport.parity.io/goto/6Nz6bIASR?orgId=1

Connection attempt to PeerId("12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav") 
failed with WrongPeerId { obtained: PeerId("12D3KooWMh2imeAzsZKGQgm2cv6Uoep3GBYtwGfujt1bs5YfVzkH"),
endpoint: Dialer { 
address: "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", 
role_override: Dialer } }.

When I look with subp2p-explorer, there are multiple peers reporting the same IP and port, I wouldn't expect that to happen, e.g 34.32.137.4/tcp/30333/ seems to be discovered on: rococo-validator-f-node-6 , rococo-validator-d-node-7, rococo-validator-e-node-5, rococo-validator-d-node-5, rococo-validator-i-node-3

164:authority="5CyAwh1sbGPqneT1ZdpMArRMLEf69RL6iThLxAxt2ZAuCHnD" peer_id=PeerId("12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-6)" version={"/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.89.111/tcp/32012/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.89.111/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.234.172.171/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn", "/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWKyWokpFdZXRzwP7XtfhXzQgK7q8BR8hnTWwLSAm39UHn"} 
167:authority="5DLx7NUaNEVKazF8vVCcXKMqYRdV5t9Hpaxn19FsM7YSzzEb" peer_id=PeerId("12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-d-node-7)" version={"/ip4/34.141.146.74/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/35.204.89.111/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.91.205.11/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.90.84.170/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.91.151.87/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.147.59.8/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/34.90.210.63/tcp/30333/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp", "/ip4/35.204.89.111/tcp/30737/p2p/12D3KooWJbqY5k2JpuJnGcmum6CzP557arN1mHaGMZCuN2R4qcHp"} 
225:authority="5GMtYUSu9Lf5DkFAj8XiuZ4GxJURu9xofkvfs4Kx1xzhQoYn" peer_id=PeerId("12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-1)" version={"/ip4/35.204.19.37/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/34.91.196.154/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/35.204.19.37/tcp/31532/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/35.204.134.137/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav"} 
230:authority="5GYnjEgkfjG2XK52XuZ3dUqha1QqXPkhpGhBvfeQnrF8EZ6X" peer_id=PeerId("12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-e-node-5)" version={"/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/35.204.134.137/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/35.234.172.171/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/35.204.237.204/tcp/32356/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWLavhDKUDwtsAxny4KxGvzKudTJfNU1J9uPg5zGBeMVMs"} 
239:authority="5GukpQ8z2EeUf6oauwkAaZfHqkEewZryTeLAwAZnkbCkFch2" peer_id=PeerId("12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-d-node-5)" version={"/ip4/35.204.134.137/tcp/32315/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.196.154/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.211.207/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.205.11/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.90.84.170/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.147.59.8/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.147.83.199/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/35.204.134.137/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.91.151.87/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD", "/ip4/34.32.163.216/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD"} 
249:authority="5HGraXWF7cRJatExhny7n2D4BQsLAddnngQnn5DhvDbkzAxS" peer_id=PeerId("12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA") addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-i-node-3)" version={"/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/35.204.237.204/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/35.204.10.176/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.147.63.230/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.91.196.154/tcp/30792/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.91.196.154/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA", "/ip4/34.90.92.174/tcp/30333/p2p/12D3KooWBqg6MWUebHMPdQUKjhA7CMJsRojRZZcS1JPwpPiq6zPA"} 

Is this something expected ?

github-merge-queue bot pushed a commit that referenced this issue Mar 8, 2024
…up (#3544)

Looking at rococo-asset-hub
#3519 there seems to be
a lot of instances where collator did not advertise their collations,
while there are multiple problems there, one of it is that we are
connecting and disconnecting to our assigned validators every block,
because on reconnect_timeout every 4s we call connect_to_validators and
that will produce 0 validators when all went well, so set_reseverd_peers
called from validator discovery will disconnect all our peers.
More details here:
#3519 (comment)

Now, this shouldn't be a problem, but it stacks with an existing bug in
our network stack where if disconnect from a peer the peer might not
notice it, so it won't detect the reconnect either and it won't send us
the necessary view updates, so we won't advertise the collation to it
more details here:

#3519 (comment)

To avoid hitting this condition that often, let's keep the peers in the
reserved set for the entire duration we are allocated to a backing
group. Backing group sizes(1 rococo, 3 kusama, 5 polkadot) are really
small, so this shouldn't lead to that many connections. Additionally,
the validators would disconnect us any way if we don't advertise
anything for 4 blocks.

## TODO
- [x] More testing.
- [x] Confirm on rococo that this is improving the situation. (It
doesn't but just because other things are going wrong there).

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Mar 11, 2024

Updates

The current state of affairs is that it seems that validators are publishing in DHT ports and IPs that are not meant to be used from outside, so collators will randomly fail to connect to validators when using this ports and IPs:
E.g: https://grafana.teleport.parity.io/goto/aMX9dQAIR?orgId=1

The addresses the collators sees for that node, these are all the IPs that the validator pod had allocated during its lifetime.

Addresses of PeerId("12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7"): [
"/ip4/34.90.232.201/tcp/30032", "/ip4/34.147.63.230/tcp/30333", "/ip4/35.234.172.171/tcp/30333", "/ip4/34.90.232.201/tcp/30333", "/ip4/35.204.190.160/tcp/30333", "/ip4/34.147.36.17/tcp/30333", 
"/ip4/34.141.218.8/tcp/30333", "/ip4/35.234.160.139/tcp/30333", "/ip4/34.91.123.48/tcp/30333", 
"/ip4/34.90.210.63/tcp/30333", "/ip4/34.90.1.96/tcp/30333",
 "/ip4/34.90.1.96/tcp/30032", "/ip4/34.147.95.62/tcp/30333", 
"/ip4/34.147.95.62/tcp/30032", "/ip4/34.90.62.145/tcp/30333", 
"/ip4/34.34.35.247/tcp/30333", "/ip4/34.32.249.188/tcp/30032",
 "/ip4/34.32.249.188/tcp/30333"
] 

The fix here should be to find a way not to expose externally this addresses, @altonen suggested something like this:

We could introduce a CLI flag --no-external-address-discovery that would call
 [Swarm::remove_external_address()](https://docs.rs/libp2p/latest/libp2p/struct.Swarm.html#method.remove_external_address) [here](https://github.com/paritytech/polkadot-sdk/blob/1c435e91c117b877c803427a91c0ccd18c527382/substrate/client/network/src/discovery.rs#L642-L658) 
when a new address is discovered, unless it discovers one of the specified addresses. 

Based on the issue, https://github.com/paritytech/polkadot-sdk/issues/2523,  we saw in December with the 
misconfigured proxy, I'd assume people external to Parity might be interested in this feature as well.

@BulatSaif @PierreBesson @altonen : Feel free to add more context.

@bkchr
Copy link
Member

bkchr commented Mar 11, 2024

The addresses the collators sees for that node, these are all the IPs that the validator pod had allocated during its lifetime.

Maybe we need to prune more eagerly old addresses. However, I don't get how the collator is not able to connect to the validator? Or do you want to tell me that none of the addresses in this list is actually the address of the validator right now? Otherwise I would expect that when I want to connect to the node, it will try all the addresses.

@alexggh
Copy link
Contributor Author

alexggh commented Mar 11, 2024

The addresses the collators sees for that node, these are all the IPs that the validator pod had allocated during its lifetime.

Maybe we need to prune more eagerly old addresses.

I'm really noob around that part, so I'm fine with anything that would get us out of this flaky state.

However, I don't get how the collator is not able to connect to the validator? Or do you want to tell me that none of the addresses in this list is actually the address of the validator right now? Otherwise I would expect that when I want to connect to the node, it will try all the addresses.

In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.

Connection attempt to PeerId("12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7") failed with WrongPeerId { obtained: PeerId("12D3KooWM56JbKWAXsDyWh313z73aKYVMp1Hj2nSnAKY3q6MnoC9"), endpoint: Dialer { address: "/ip4/34.90.210.63/tcp/30333/p2p/12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7", role_override: Dialer } }.

However, in the past days I did find cases where even when trying an expected port the dial was failing.

@dmitry-markin
Copy link
Contributor

There are three things we should try:

  1. Check if [dnm] detect closed substream #3426 is safe for merging (real life tests are needed on Polkadot+Kusama+Westend, as in the past we had issues when there were not enough inbound slots on the network) and if it improves the situation with validators not noticing collator reconnections.
  2. Implement what was proposed above with --no-external-address-discovery to safeguard against incorrectly reported external addresses by remote nodes.
  3. Understand why none of the advertised addressed is the correct external address:

In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.

Bullet 3. I understand the least, as the node should report at least one correct address. This one is IMO the most important one, as it directly affects the connectivity.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Mar 12, 2024

In this case none of those ports are actually meant to be outside and the dials fail with because that IP is allocated somewhere else.

This sounds like the "underlying bug" to me. I mean given the way nodes collect these addresses, we should not stop trying to connect because the peerid behind some address is something different. We need to try to connect to all addresses until we have found the peer we are searching for or we tried all addresses.

libp2p concurrently dials 8 known addresses:
https://github.com/libp2p/rust-libp2p/blob/v0.51/swarm/src/connection/pool/concurrent_dial.rs#L56-L61
https://github.com/libp2p/rust-libp2p/blob/v0.51/swarm/src/connection/pool.rs#L1132
And tries the rest as soon as some dial fails:
https://github.com/libp2p/rust-libp2p/blob/v0.51/swarm/src/connection/pool/concurrent_dial.rs#L91-L96

In substrate, we just log such errors once the connection is finally established:

if let Some(errors) = concurrent_dial_errors {
debug!(target: "sub-libp2p", "Libp2p => Connected({:?}) with errors: {:?}", peer_id, errors);

If we expect nodes to have more than 8 addresses in DHT, it probably makes sense to increase the concurrency factor, as we currently don't set the transport timeout for p2p connections (and with default TCP SYN retries it's about 2 mins on linux). I would also set the transport timeout to some defined smaller value, like 15-20 secs.

@dmitry-markin
Copy link
Contributor

Why the validators are changing their public addresses so frequently in the first place? Is it because they run on spot instances? May be we shouldn't use spot instances for validators?

@alexggh
Copy link
Contributor Author

alexggh commented Mar 12, 2024

Why the validators are changing their public addresses so frequently in the first place? Is it because they run on spot instances? May be we shouldn't use spot instances for validators?

The arguments this nodes are started are:

--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333

From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on.

@altonen
Copy link
Contributor

altonen commented Mar 12, 2024

Extending Alex's reply, my understanding of the issue is that the pod private port 30333 is accidentally exposed, instead of the public address port, 32685. The private port could be exposed either by port reuse, where by it binds to 30333 and the remote nodes somehow observe the node dialing from this address, or through address translation (link, link), where it translates the observed address to its listen address. To verify what is happening, you may need libp2p trace logs from both the listener and dialer.

FWIW, I think there were two different issues: peer ID mismatch and connection refused.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Mar 12, 2024

From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on.

Remote nodes learn about our addresses via Identify protocol and add reported addresses to DHT. rust-libp2p combines listen addresses with external addresses and sends all of them via Identify message:
https://github.com/libp2p/rust-libp2p/blob/v0.51/protocols/identify/src/behaviour.rs#L341-L346
and we later add them to DHT:

for addr in listen_addrs {
self.network_service
.behaviour_mut()
.add_self_reported_address_to_dht(&peer_id, &protocols, addr);
}

What we can try doing to ensure remote nodes do not learn our listen addresses (and do not add them to DHT), is not report them to Identify behavior here:

self.identify.on_swarm_event(FromSwarm::NewListenAddr(e));

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Mar 12, 2024

Extending Alex's reply, my understanding of the issue is that the pod private port 30333 is accidentally exposed, instead of the public address port, 32685.

In this case it's not (only) an accident, but the current logic of Identify, which always publishes our listen addresses in addition to observed/set external addresses.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Mar 12, 2024

@alexggh Can you check if the issue is mitigated if a collator runs a patch from #3656?

We also have a fix that should remove rogue listen addresses from DHT for validators #3657, but need to run a burn-in for it to make sure it's safe.

@alexggh
Copy link
Contributor Author

alexggh commented Mar 12, 2024

@alexggh Can you check if the issue is mitigated if a collator runs a patch from #3656?

@BulatSaif is our chief burn-in officer, can you please deploy asset-hub with the above PR.
Thank you!

@lexnv
Copy link
Contributor

lexnv commented Mar 12, 2024

Looking a bit at the logs Alex provided above:

  • subp2p-explorer collects the addresses reported in the DHT here
    • These are the addresses from the logs above. The Identify protocol is only used to extract the agent_version (logs from Alex here)

Advertising DHT addresses

TLDR; I don't think #3657 solves the issue here. @altonen @dmitry-markin could you double check please? 🙏

Different PeerID but Same Addresss

225:authority="5GMtYUSu9Lf5DkFAj8XiuZ4GxJURu9xofkvfs4Kx1xzhQoYn"
   peer_id=PeerId("12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav")
   addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-f-node-1)" 
  "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav"


239:authority="5GukpQ8z2EeUf6oauwkAaZfHqkEewZryTeLAwAZnkbCkFch2"
   peer_id=PeerId("12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD")
   addresses="Parity Polkadot/v1.8.0-ec7817e5adc (rococo-validator-d-node-5)" 
  "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD"
  

In this case, the DHT record contains the address /ip4/34.32.137.4/tcp/30333/p2p/ for different peer IDs 12D3KooWFEZ26Sg5bjJRU4aBEu6HQMMditJ9Fc7zD1phqr8Ypkav and 12D3KooWARWudQtHxJ7589kJtxr3QD4fXpMDrQxiJ8okrbBGbtSD.

I'd have to look again at the subp2p-explorer to make sure I report the data correctly there. Although Alex mentioned #3519 (comment) that the WrongPeerID error was encountered by our grafana logs.

This could happen when the nodes that we manage share the external address. And they go on this code path:

a.with(multiaddr::Protocol::P2p(peer_id))

From this line, if the rococo-validator-f-node-1 and rococo-validator-d-node-5 share /ip4/34.32.137.4/tcp/30333, we'll add the peerID of the local node and advertise that address in the DHT. And so we'll end up with two peers pointed at the same address.

I think this could happen if the nodes are placed under the same network (likely)?
Would love to get your thoughts on this as I don't have that much context for the rococo validators here 🙏

@BulatSaif
Copy link
Contributor

The arguments this nodes are started are:

--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333

From what I understoond, that listen-addr gets resolved to all those different IPs during the lifetime of the node, but those IPs should never have been exposed on DHT because you can't use them for connecting. @BulatSaif @altonen feel free to add your more in depth understanding of what is going on.

Here how you can reproduce issue locally:

  1. Start node
docker run -it --rm parity/polkadot:v1.8.0 --chain rococo  --public-addr=/ip4/35.204.131.85/tcp/32685   --listen-addr=/ip4/127.0.0.1/tcp/22
  1. After a few second you will see such logs:
2024-03-12 12:06:36 🔍 Discovered new external address for our node: /ip4/35.204.131.85/tcp/32685/p2p/12D3KooWLf25UasBJcozLYUQiN1dgZidPpzAi7kPTg4eoXt81gam
...
2024-03-12 12:06:37 🔍 Discovered new external address for our node: /ip4/212.227.198.212/tcp/22/p2p/12D3KooWLf25UasBJcozLYUQiN1dgZidPpzAi7kPTg4eoXt81gam 

The second discovered address is incorrect, 212.227.198.212 is your ISP IP.

This mainly affects testnets since they are usually deployed within a container environment, such as Kubernetes. However, I've also seen cases where comunity mainet RPC bootnodes were deployed in a private network and exposed with port forwarding on the edge router they have same issue.

@dmitry-markin
Copy link
Contributor

@lexnv My understanding of the issue is that the validators are run on spot instances, which have both:

  1. A configured external address (some snat/dnat configuration?)
  2. Some ephemeral global address of the spot instance the node is listening on.

This ephemeral address is reported via Identify to remote nodes, and they add it to DHT because it's global. And this ephemeral address changes periodically, and gets added to DHT again and again.

As a result, we have plenty of obsolete addresses in the DHT. And as we can see from #3519 (comment) all those addresses are global — they are the addresses the pods had at some point.

@BulatSaif
Copy link
Contributor

@alexggh Can you check if the issue is mitigated if a collator runs a patch from #3656?

@BulatSaif is our chief burn-in officer, can you please deploy asset-hub with the above PR. Thank you!

@alexggh @dmitry-markin #3656 is deployed on asset-hub-rococo

@lexnv
Copy link
Contributor

lexnv commented Mar 12, 2024

Looking at the logs from grafana:

	
2024-03-12 15:38:04.239 DEBUG tokio-runtime-worker libp2p_swarm: [Relaychain] Connection attempt to PeerId("12D3KooWJhwBW1UjxhxHaCUCWYmCRjsWx6ZEbwHTqfQmPjNx8GjC") failed with WrongPeerId { obtained: PeerId("12D3KooWMh2imeAzsZKGQgm2cv6Uoep3GBYtwGfujt1bs5YfVzkH"), endpoint: Dialer { address: "/ip4/34.32.137.4/tcp/30333/p2p/12D3KooWJhwBW1UjxhxHaCUCWYmCRjsWx6ZEbwHTqfQmPjNx8GjC", role_override: Dialer } }. 

...

2024-03-12 15:26:24.212  WARN tokio-runtime-worker peerset: [Relaychain] Trying to remove unknown reserved node 12D3KooWRs9TVyiBTZwRBLaxogCHvysaBigod3TFU2LXEXn9BZP7 from SetId(3).    
	
2024-03-12 15:20:36.109  WARN tokio-runtime-worker parachain::collator-protocol: [Relaychain] Collation wasn't advertised to any validator. candidate_hash=0x1f91c6421302e9acbf6d5ecd02369e2a9e6787fbe9722d86f46e3fe9a9d2d77a pov_hash=0x069b0c4cd14d90105f911b6e0ca4d1a3209498441628baa835524b9bd9f8f8bc traceID=41962972065863927574614223903885008426

2024-03-12 15:20:24.223  WARN tokio-runtime-worker peerset: [Relaychain] Trying to remove unknown reserved node 12D3KooWPpC6rv56G2NiKKj4V6wHqQ1vMzymLsXrLm4wihDG4Acr from SetId(3).    

2024-03-12 15:17:33.053  WARN tokio-runtime-worker libp2p_kad::behaviour: [Relaychain] Locating closest peers for replication failed: Err(Timeout { key: Key(b"\x83\x1e\xec\x1d\x1f<\x18\x88&\xf4\\9\xc7]\x9a<\x15\xd9\x9d0Xj\xda\xe2m1'c\xe6Kw\x1c"), success: [], quorum: 20 })    

2024-03-12 15:07:38.250  WARN tokio-runtime-worker peerset: [Relaychain] Report 12D3KooWGX1RBgFn1rqAJDr1X5gm1wBcFPziAqXR7eTHYhBTBBE2: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    

2024-03-12 14:57:40.743  WARN                 main sc_service::config: [Parachain] Using default protocol ID "sup" because none is configured in the chain specs    

@dmitry-markin
Copy link
Contributor

The arguments this nodes are started are:

--public-addr=/ip4/35.204.131.85/tcp/32685 --listen-addr=/ip4/0.0.0.0/tcp/30333

@alexggh just to clarity, how is the listen address mapped to the public address (i.e., why dialing 35.204.131.85:32685 would reach the node)?

Is there some dnat / port forwarding rule in the firewall?

@alexggh
Copy link
Contributor Author

alexggh commented Mar 13, 2024

@BulatSaif Should be able to help you here ^^, I'm not from the infra team, so I'm not able to explain to you our networking setup.

@BulatSaif
Copy link
Contributor

Member

We use Kubernetes NodePort, which has various implementations, but the simplest one involves iptables rules. Any traffic received on the specific port by any VM in the cluster will be forwarded to the correct pod.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Mar 13, 2024

We had a discussion with @BulatSaif about validators' networking in kubernetes on spot instances, but it's still not completely clear how multiple external addresses end up in the authority discovery DHT record.

For sure, there are two issues:

  1. Address translation reported in rococo asset-hub and others don't always advertise their collations  #3519 (comment), which is the reason why private ports 30333 end up in the authority discovery record.
  2. Accumulation of temporary/ephemeral pod addresses in the authority discovery record. This shouldn't happen from our understanding of pod networking, but nevertheless happens. The authority discovery record in DHT should be overwritten by the most recent --public-addr used, but this doesn't happen for some reason. My wild guess is that DHT might be partitioned / put_record() requests do not propagate to all target peers, and there are multiple authority records present at the same time, all found by subp2p-explorer.

I'm working on a fix #3657 that will resolve issue 1 once extended on authority discovery DHT records and all addresses, including external addresses not specified with --public-addr. It could resolve issue 2 if the guess about DHT partitioning is wrong or something is off with the validator kubernetes networking.

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Mar 13, 2024

@BulatSaif can we deploy up to date master to one of the validators to see what #3668 is reporting? This is needed to rule out that the validator is publishing rogue addresses into the DHT authority discovery record.

-lsub-authority-discovery=debug flag is needed.

@BulatSaif
Copy link
Contributor

@BulatSaif can we deploy up to date master to one of the validators to see what #3668 is reporting? This is needed to rule out that the validator is publishing rogue addresses into the DHT authority discovery record.

-lsub-authority-discovery=debug flag is needed.

rococo-validator-a* updated

@alexggh
Copy link
Contributor Author

alexggh commented Mar 20, 2024

@bkchr root-caused this parts of this issue here: #3673 (comment)

I think because of this, we are going to truncate out the good addresses.

.flatten()
.take(MAX_ADDRESSES_PER_AUTHORITY)

github-merge-queue bot pushed a commit that referenced this issue Mar 22, 2024
)

Make sure explicitly set by the operator public addresses go first in
the authority discovery DHT records.

Also update `Discovery` behavior to eliminate duplicates in the returned
addresses.

This PR should improve situation with
#3519.

Obsoletes #3657.
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this issue Mar 24, 2024
…ritytech#3757)

Make sure explicitly set by the operator public addresses go first in
the authority discovery DHT records.

Also update `Discovery` behavior to eliminate duplicates in the returned
addresses.

This PR should improve situation with
paritytech#3519.

Obsoletes paritytech#3657.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants