Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused. #3048

shamil-gadelshin · 2022-10-20T12:01:38Z

Summary

I use start_providing Kademlia API

Here is my error after 32 requests:

2022-10-20T09:54:15.538792Z  WARN libp2p_kad::handler: New inbound substream to PeerId("12D3KooWNwexVxD22CAdxv5CAkxrD3QgUqhycYAE1e1NH2sLbZ7v") exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream.

This error is similar to the recent discussion: #2957

However, I have only 2 peers (local machine setup) and 3 seconds between start_providing requests.

Here is my local inbound_streams buffer:

 ["WaitingUser: id=UniqueConnecId(11)", "WaitingUser: id=UniqueConnecId(13)", "WaitingUser: id=UniqueConnecId(16)", "WaitingUser: id=UniqueConnecId(18)", "WaitingUser: id=UniqueConnecId(20)", "WaitingUser: id=UniqueConnecId(22)", "WaitingUser: id=UniqueConnecId(24)", "WaitingUser: id=UniqueConnecId(26)", "WaitingUser: id=UniqueConnecId(28)", "WaitingUser: id=UniqueConnecId(3)", "WaitingUser: id=UniqueConnecId(30)", "WaitingUser: id=UniqueConnecId(32)", "WaitingUser: id=UniqueConnecId(34)", "WaitingUser: id=UniqueConnecId(36)", "WaitingUser: id=UniqueConnecId(39)", "WaitingUser: id=UniqueConnecId(41)", "WaitingUser: id=UniqueConnecId(43)", "WaitingUser: id=UniqueConnecId(45)", "WaitingUser: id=UniqueConnecId(47)", "WaitingUser: id=UniqueConnecId(49)", "WaitingUser: id=UniqueConnecId(5)", "WaitingUser: id=UniqueConnecId(51)", "WaitingUser: id=UniqueConnecId(53)", "WaitingUser: id=UniqueConnecId(55)", "WaitingUser: id=UniqueConnecId(57)", "WaitingUser: id=UniqueConnecId(59)", "WaitingUser: id=UniqueConnecId(61)", "WaitingUser: id=UniqueConnecId(63)", "WaitingUser: id=UniqueConnecId(65)", "WaitingUser: id=UniqueConnecId(67)", "WaitingUser: id=UniqueConnecId(7)", "WaitingUser: id=UniqueConnecId(9)"]

It seems that I need to acknowledge some of the requests but the API doesn't expect this. Also, I can send hundreds of get_closest_peers requests with no errors.

When an error begins to manifest the requesting peer gets a QueryStat:

QueryStats { requests: 1, success: 0, failure: 1, start: Some(Instant { t: 2021689972560 }), end: Some(Instant { t: 2021689984792 }) }

That indicates that the first out of two requests (it seems that start_providing issues FindNode and AddProvider requests) produces an error but numerous GetClosestPeers work ok. It doesn't make sense to me.

Also, when I increase an interval between start_providing API calls to 17secs - it doesn't seem to produce an error. I tried to set Kademlia query timeout from the default 60 to just 1 second (I suspected some pending process) but it doesn't make a difference.

Am I missing something?

Expected behaviour

I expect the local setup to handle multiple requests per second with no issues.

Actual behaviour

Debug Output

FindNodeReq: ConnectionId(2)
2022-10-20T11:43:57.299813Z  INFO subspace_networking::node_runner: Kademlia event: InboundRequest { request: FindNode { num_closer_peers: 1 } }
2022-10-20T11:43:57.300924Z DEBUG yamux::connection::stream: 44d40e66/139: eof    
2022-10-20T11:43:57.301396Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:57.301431Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:57.301473Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol    
2022-10-20T11:43:57.302384Z  INFO subspace_networking::behavior::custom_record_store: New provider record added: ProviderRecord { key: Key(b"\0 \x89\x06;^\xab\xa0hP\xc4\xb6\xc9W\x1cFV\xc8|\xce\xb6\x04&M%\xe9 y\xe3\x9a\x89\x81[\xa6"), provider: PeerId("12D3KooWSPnggmEVXnPYDjaueqRm5EjC1cVejyaRUg2V6BThVxpR"), expires: Some(Instant { t: 4087037477371 }), addresses: ["/ip4/127.0.0.1/tcp/60010"] }
2022-10-20T11:43:57.302474Z  INFO subspace_networking::node_runner: Kademlia event: InboundRequest { request: AddProvider { record: None } }
2022-10-20T11:43:57.302482Z  INFO subspace_networking::node_runner: Add provider request received: None
2022-10-20T11:43:57.899984Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:57.900226Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:58.899019Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:58.899206Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:59.000034Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:59.000098Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:59.000151Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol    
2022-10-20T11:43:59.000197Z  WARN libp2p_kad::handler: New inbound substream to PeerId("12D3KooWSPnggmEVXnPYDjaueqRm5EjC1cVejyaRUg2V6BThVxpR") exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream.    
2022-10-20T11:43:59.000496Z  WARN libp2p_kad::handler: Inbound streams: ["WaitingUser: id=UniqueConnecId(12)", "WaitingUser: id=UniqueConnecId(14)", "WaitingUser: id=UniqueConnecId(16)", "WaitingUser: id=UniqueConnecId(18)", "WaitingUser: id=UniqueConnecId(2)", "WaitingUser: id=UniqueConnecId(21)", "WaitingUser: id=UniqueConnecId(23)", "WaitingUser: id=UniqueConnecId(25)", "WaitingUser: id=UniqueConnecId(27)", "WaitingUser: id=UniqueConnecId(29)", "WaitingUser: id=UniqueConnecId(31)", "WaitingUser: id=UniqueConnecId(33)", "WaitingUser: id=UniqueConnecId(35)", "WaitingUser: id=UniqueConnecId(37)", "WaitingUser: id=UniqueConnecId(39)", "WaitingUser: id=UniqueConnecId(42)", "WaitingUser: id=UniqueConnecId(44)", "WaitingUser: id=UniqueConnecId(46)", "WaitingUser: id=UniqueConnecId(48)", "WaitingUser: id=UniqueConnecId(5)", "WaitingUser: id=UniqueConnecId(50)", "WaitingUser: id=UniqueConnecId(52)", "WaitingUser: id=UniqueConnecId(54)", "WaitingUser: id=UniqueConnecId(56)", "WaitingUser: id=UniqueConnecId(58)", "WaitingUser: id=UniqueConnecId(60)", "WaitingUser: id=UniqueConnecId(62)", "WaitingUser: id=UniqueConnecId(64)", "WaitingUser: id=UniqueConnecId(66)", "WaitingUser: id=UniqueConnecId(68)", "WaitingUser: id=UniqueConnecId(7)", "WaitingUser: id=UniqueConnecId(9)"]    
2022-10-20T11:43:59.898505Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:59.898595Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:59.950796Z DEBUG subspace_networking::node_runner: Initiate connection to known peers local_peer_id=12D3L7AUynGJcx7Lb4SAo4bR1m78UYydB4TRy2Nre6vViLHPgEem connected_peers=1
2022-10-20T11:44:00.701763Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:44:00.701828Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:44:00.701882Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol

Possible Solution

Version

I use 0.46.1 in my own branch but the latest 0.49.0 produces the same result.

Would you like to work on fixing this bug?

No

The text was updated successfully, but these errors were encountered:

shamil-gadelshin · 2022-10-27T17:13:44Z

This issue I mentioned on the last community call. @mxinden @thomaseizinger You had already a similar discussion but this issue arises on a much smaller load.

thomaseizinger · 2022-10-28T04:27:54Z

Thanks for reporting this! This definitely doesn't look right. I am putting it on my list to work on for next week :)

thomaseizinger · 2022-11-02T07:05:40Z

Are you polling the Swarm properly? Can you share the code that you are running?

I looked into the code and these substreams shouldn't really be sitting there idling for very long. They are waiting for the NetworkBehaviour to gather the data, which should be extremely quick and can then resume again.

Maybe the main event loop of the Behaviours is blocked from some reason? Can you reproduce the problem if you are just using the Kademlia behaviour and nothing else?

thomaseizinger · 2022-11-02T07:29:06Z

Also, can you try and test with #3074? I think this one is more correct in terms of task wake behaviour. Maybe that is the issue.

thomaseizinger · 2022-11-07T22:06:41Z

Friendly ping @shamil-gadelshin!

Does #3074 help at all?

shamil-gadelshin · 2022-11-08T06:48:26Z

I tried that. The issue stays. Sorry for the delayed response. I was going to debug it again and provide you with extensive feedback. I'm going to do it this week.

Thank you for your attention to this bug. I appreciate it.

thomaseizinger · 2022-11-08T08:21:20Z

I tried that. The issue stays. Sorry for the delayed response. I was going to debug it again and provide you with extensive feedback. I'm going to do it this week.

Thank you for your attention to this bug. I appreciate it.

Ah damn. Any chance you can provide a minimal example that produces the problem?

shamil-gadelshin · 2022-11-08T08:39:46Z

Sure. I will try again debugging my code to rule out a silly mistake and then try to reproduce the issue having the minimal setup.

shamil-gadelshin · 2022-11-11T13:22:12Z

Here is the minimal Kademlia setup that shows differences between start_providing and put_record or get_closest_peers . It's possible that I misuse the Kademlia API and I appreciate a hint @thomaseizinger

shamil-gadelshin · 2022-11-11T13:23:15Z

https://github.com/shamil-gadelshin/kad-example @thomaseizinger

thomaseizinger · 2022-11-11T20:30:32Z

Thank you! I'll have a look!

shamil-gadelshin · 2022-11-21T13:00:26Z

a friendly ping @thomaseizinger Did you have a chance to look at the kad-example project? Does it reproduce the error?

thomaseizinger · 2022-11-21T21:21:18Z

a friendly ping @thomaseizinger Did you have a chance to look at the kad-example project? Does it reproduce the error?

Sorry, I haven't yet but I'll do so today!

thomaseizinger · 2022-11-21T23:51:22Z

Okay, I figured out what I think the issue is.

As per the spec, outbound substreams may be reused. Our implementation never does that (we only send 1 request per substream) but our inbound streams wait for additional messages on that stream and thus fill up the buffer.

There seems to be a bug where the implementation for the inbound stream does not detect that the other side closed the stream and it should thus stop waiting for a message.

thomaseizinger · 2022-11-22T00:28:34Z

Damn, that is not the issue ...

thomaseizinger · 2022-11-22T00:49:20Z

Okay, I have a fix.

The issue was that we had substream that were waiting for a response from the behaviour even though for AddProvider, there is no response. So instead of actually reusing the substeam, the substream was in WaitingUser state but that one never got answered.,

thomaseizinger · 2022-11-22T05:40:57Z

I opened a fix here: #3152

With this patch, the example you provided no longer issues the warnings. Thanks for providing that example, it was really helpful in the debugging process!

shamil-gadelshin · 2022-11-22T08:42:11Z

I will test the fix in the test project and in our main project as well. Thanks a lot!

Previously, we would erroneously always go into the `WaitingUser` (now called `WaitingBehaviour`) state after receiving a message on an inbound stream. However, the `AddProvider` message does not warrant a "response" from the behaviour. Thus, any incoming `AddProvider` message would result in a stale substream that would eventually be dropped as soon as more than 32 inbound streams have been opened. With this patch, we inline the message handling into the upper match block and perform the correct state transition upon each message. For `AddProvider`, we go back into `WaitingMessage` because the spec mandates that we need to be ready to receive more messages on an inbound stream. Fixes #3048.

shamil-gadelshin mentioned this issue Oct 23, 2022

Piece receiving via DSN (part 1) autonomys/subspace#878

Merged

1 task

shamil-gadelshin closed this as completed in autonomys/subspace#878 Oct 25, 2022

shamil-gadelshin reopened this Oct 25, 2022

thomaseizinger self-assigned this Oct 28, 2022

thomaseizinger mentioned this issue Nov 2, 2022

protocols/kademlia: Refactor stream state machines #3074

Merged

4 tasks

thomaseizinger added the need/author-input Needs input from the original author label Nov 7, 2022

thomaseizinger removed the need/author-input Needs input from the original author label Nov 8, 2022

shamil-gadelshin mentioned this issue Nov 8, 2022

DSN. Add GSet for providers for archival storage. autonomys/subspace#908

Merged

1 task

shamil-gadelshin mentioned this issue Nov 21, 2022

DSN. Object retrieval autonomys/subspace#939

Closed

15 tasks

thomaseizinger mentioned this issue Nov 22, 2022

fix(kad): Don't wait for behaviour after receiving AddProvider #3152

Merged

4 tasks

mergify bot closed this as completed in #3152 Nov 23, 2022

shamil-gadelshin mentioned this issue Nov 28, 2022

DSN. Change push-model to pull-model in the Kademlia setup. autonomys/subspace#958

Closed

tdimitrov mentioned this issue Jan 29, 2023

libp2p warnings on Kusama - New inbound substream to PeerId(XXX) exceeds inbound substream limit. paritytech/substrate#13266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused. #3048

Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused. #3048

shamil-gadelshin commented Oct 20, 2022

shamil-gadelshin commented Oct 27, 2022

thomaseizinger commented Oct 28, 2022

thomaseizinger commented Nov 2, 2022

thomaseizinger commented Nov 2, 2022

thomaseizinger commented Nov 7, 2022

shamil-gadelshin commented Nov 8, 2022

thomaseizinger commented Nov 8, 2022

shamil-gadelshin commented Nov 8, 2022

shamil-gadelshin commented Nov 11, 2022

shamil-gadelshin commented Nov 11, 2022

thomaseizinger commented Nov 11, 2022

shamil-gadelshin commented Nov 21, 2022

thomaseizinger commented Nov 21, 2022

thomaseizinger commented Nov 21, 2022

thomaseizinger commented Nov 22, 2022

thomaseizinger commented Nov 22, 2022

thomaseizinger commented Nov 22, 2022

shamil-gadelshin commented Nov 22, 2022

Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused. #3048

Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused. #3048

Comments

shamil-gadelshin commented Oct 20, 2022

Summary

Expected behaviour

Actual behaviour

Possible Solution

Version

Would you like to work on fixing this bug?

shamil-gadelshin commented Oct 27, 2022

thomaseizinger commented Oct 28, 2022

thomaseizinger commented Nov 2, 2022

thomaseizinger commented Nov 2, 2022

thomaseizinger commented Nov 7, 2022

shamil-gadelshin commented Nov 8, 2022

thomaseizinger commented Nov 8, 2022

shamil-gadelshin commented Nov 8, 2022

shamil-gadelshin commented Nov 11, 2022

shamil-gadelshin commented Nov 11, 2022

thomaseizinger commented Nov 11, 2022

shamil-gadelshin commented Nov 21, 2022

thomaseizinger commented Nov 21, 2022

thomaseizinger commented Nov 21, 2022

thomaseizinger commented Nov 22, 2022

thomaseizinger commented Nov 22, 2022

thomaseizinger commented Nov 22, 2022

shamil-gadelshin commented Nov 22, 2022