Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

A larger data payload in the response of a block request can't be sent back due to KeepAliveTimeout and then peer disconnected #12105

Open
2 tasks done
liuchengxu opened this issue Aug 25, 2022 · 9 comments

Comments

@liuchengxu
Copy link
Contributor

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

When a node requests a chunk of blocks from another node, if these blocks happen to be huge, the block request is successfully handled on the server node but failed to be sent back to the requester due to the error KeepAliveTimeout

2022-08-24 17:33:04.394 DEBUG tokio-runtime-worker libp2p_swarm: [PrimaryChain] Connection closed with error KeepAliveTimeout: Connected { endpoint: Listener { local_addr: "/ip4/192.168.1.2/tcp/30333", send_back_addr: "/ip4/85.41.34.50/tcp/55227" }, peer_id: PeerId("12D3KooWG87xHK71UEKkw4pLCAkaa2C4vCnAsk5hVYD6dzQdua6G") }; Total (peer): 0.    
2022-08-24 17:33:04.394 DEBUG tokio-runtime-worker sub-libp2p: [PrimaryChain] Libp2p => Disconnected(PeerId("12D3KooWG87xHK71UEKkw4pLCAkaa2C4vCnAsk5hVYD6dzQdua6G"), Some(KeepAliveTimeout)) 

Here is the server node log, we can see the byte size of the first request result is 299402, which is sent successfully, the second request result size is 10534306 which failed to be sent back. The reason is that there are some data store transactions after block 64, causing the block size to increase significantly.

2022-08-24 17:28:48.407 DEBUG tokio-runtime-worker libp2p_dns: [PrimaryChain] Dialing /ip4/172.19.0.1/tcp/21000/p2p/12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS    
2022-08-24 17:28:52.713 DEBUG tokio-runtime-worker libp2p_swarm: [PrimaryChain] Connection established: PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS") Listener { local_addr: "/ip4/192.168.1.2/tcp/30333", send_back_addr: "/ip4/125.121.39.96/tcp/34940" }; Total (peer): 1. Total non-banned (peer): 1    
2022-08-24 17:28:52.713 DEBUG tokio-runtime-worker sub-libp2p: [PrimaryChain] Libp2p => Connected(PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS"))    
2022-08-24 17:28:53.318 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping sent to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:28:53.319 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping received from PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:28:53.930 DEBUG tokio-runtime-worker sync: [PrimaryChain] Connected 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS    
2022-08-24 17:28:54.983 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(64)` with maximum blocks of `64`, reputation_change: `None`, small_request `false`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`.    
2022-08-24 17:28:55.054 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(64)`: blocks: Some(64), data: Some(299402)    
2022-08-24 17:28:55.055 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:28:57.831 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(128)` with maximum blocks of `64`, reputation_change: `None`, small_request `false`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`.    
2022-08-24 17:28:57.837 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(128)`: blocks: Some(3), data: Some(10534306)    
2022-08-24 17:28:57.840 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:29:08.407 DEBUG tokio-runtime-worker libp2p_kad::behaviour: [PrimaryChain] Address '/ip4/172.19.0.1/tcp/21000/p2p/12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS' removed from peer '12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS' due to error.    
2022-08-24 17:29:08.407 DEBUG tokio-runtime-worker libp2p_swarm: [PrimaryChain] Connection attempt to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS") failed with Transport([("/ip4/172.19.0.1/tcp/21000/p2p/12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS", Other(Custom { kind: Other, error: Timeout }))]).    
2022-08-24 17:29:09.122 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping received from PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:29:11.989 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping sent to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:29:26.115 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(64)` with maximum blocks of `1`, reputation_change: `None`, small_request `true`, direction `Ascending` and attributes `HEADER | JUSTIFICATION`.    
2022-08-24 17:29:26.115 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(64)`: blocks: Some(1), data: Some(4664)    
2022-08-24 17:29:26.115 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:29:26.760 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(128)` with maximum blocks of `64`, reputation_change: `None`, small_request `false`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`.    
2022-08-24 17:29:26.767 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(128)`: blocks: Some(3), data: Some(10534306)    
2022-08-24 17:29:26.770 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:29:26.989 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping received from PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:29:41.875 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping sent to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:29:56.875 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping received from PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:30:01.218 DEBUG tokio-runtime-worker libp2p_ping: [PrimaryChain] Ping sent to PeerId("12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS")    
2022-08-24 17:30:01.721 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(64)` with maximum blocks of `1`, reputation_change: `None`, small_request `true`, direction `Ascending` and attributes `HEADER | JUSTIFICATION`.    
2022-08-24 17:30:01.721 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(64)`: blocks: Some(1), data: Some(4664)    
2022-08-24 17:30:01.721 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handling block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: Starting at `BlockId::Number(128)` with maximum blocks of `64`, reputation_change: `Some(ReputationChange { value: -2147483648, reason: "Same block request multiple times" })`, small_request `false`, direction `Descending` and attributes `HEADER | BODY | JUSTIFICATION`.    
2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker sync: [PrimaryChain] Sending result of block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS starting at `BlockId::Number(128)`: blocks: None, data: None    
2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker sync: [PrimaryChain] Handled block request from 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS.    
2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker peerset: [PrimaryChain] Report    

With interior networking and large blocks which are normal for a storage chain, this issue can occur quite often. I see a few ways to help this:

  • Allow increasing the timeout, both request and response. If a chain can have a lot of large blocks, a larger timeout should be used.
  • Send back just one block of data if the data result is huge. 10534306 is actually for 3 blocks, reducing it to one will help a lot. Furthermore, a chain has a nature block size limit, once configurable, the timeout may be adjusted based on this value.

I think paritytech/polkadot-sdk#531 is related, after Timeout, we see a flood of duplicated block requests, maybe we can also do something to improve that.

Steps to reproduce

No response

@bkchr
Copy link
Member

bkchr commented Aug 25, 2022

CC @arkpar

@arkpar
Copy link
Member

arkpar commented Aug 25, 2022

@tomaka Could you pleas clarify when exactly KeepAliveTimeout is generated? Documentation says that the connection will be closed after being "idle" for specified time. Is it considered idle when the request is in progress? Here we hardcode keep-alive timeout to 10 seconds.

Also, does KeepAliveTimeout handling mean that the peer disconnects from the sync protocol as well?

@liuchengxu how fast is the network speed? Is it at least 100mbps?

@tomaka
Copy link
Contributor

tomaka commented Aug 25, 2022

The keep-alive timeout only starts after there is no activity on a connection anymore. It's the mechanism that we use to kill inactive connections.

Each protocol decides what "no activity" means. In our situation, it means no notifications protocols open and no request (in the request/response system) in progress.

Looking at the issue, the logs are cropped, and I don't see any hint of the fact that the problem could be caused by the KeepAliveTimeout error.

@arkpar
Copy link
Member

arkpar commented Aug 25, 2022

@liuchengxu Could you please full logs for both nodes for the same period of time.

@nazar-pc
Copy link
Contributor

Here is logs of the node that Liu-Cheng's node tried to sync from (we had them added as reserved nodes of each other with --reserved-only): https://nextcloud.mokrynskyi.com/s/xNtYpAsZTFY8YN7
This is with debug log level for everything, thus quite big.

@liuchengxu
Copy link
Contributor Author

liuchengxu commented Aug 25, 2022

Thanks Nazar.

@arkpar For the convient review, here is the server node log: https://files.slack.com/files-pri/T03LJ85UR5G-F03UZ31SHC3/untitled.txt, the client node log: https://files.slack.com/files-pri/T03LJ85UR5G-F03V6JZ7HQC/client_node.txt Let me know if they are not sufficient. Both nodes are running from home, I doubt the network speed can reach 100mbps.

@melekes
Copy link
Contributor

melekes commented Jan 6, 2023

I think paritytech/polkadot-sdk#531 is related, after Timeout, we see a flood of duplicated block requests, maybe we can also do something to improve that.

yeah, duplicated block requests lead to us disconnecting from that node

2022-08-24 17:30:02.747 DEBUG tokio-runtime-worker peerset: [PrimaryChain] Report 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS: -2147483648 to -214748
3648. Reason: Same block request multiple times, Disconnecting
2022-08-24 17:30:02.747  WARN tokio-runtime-worker sync: [PrimaryChain] Reserved peer 12D3KooW9rwH7iqBbynXERRrGHxYn4UbCieByXxnkPGJ3R7wShdS disconnected

@melekes
Copy link
Contributor

melekes commented Jan 6, 2023

To me this looks like this is a duplicate of paritytech/polkadot-sdk#531

@nazar-pc
Copy link
Contributor

nazar-pc commented Jan 6, 2023

I'd say this one is caused by paritytech/polkadot-sdk#531, but they are not quite the same as other slow network data transfers can potentially cause this timeout too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants