Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused. #3048

Closed
shamil-gadelshin opened this issue Oct 20, 2022 · 18 comments · Fixed by autonomys/subspace#878 or #3152
Assignees

Comments

@shamil-gadelshin
Copy link
Contributor

Summary

I use start_providing Kademlia API

Here is my error after 32 requests:

2022-10-20T09:54:15.538792Z  WARN libp2p_kad::handler: New inbound substream to PeerId("12D3KooWNwexVxD22CAdxv5CAkxrD3QgUqhycYAE1e1NH2sLbZ7v") exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream.    

This error is similar to the recent discussion: #2957

However, I have only 2 peers (local machine setup) and 3 seconds between start_providing requests.

Here is my local inbound_streams buffer:

 ["WaitingUser: id=UniqueConnecId(11)", "WaitingUser: id=UniqueConnecId(13)", "WaitingUser: id=UniqueConnecId(16)", "WaitingUser: id=UniqueConnecId(18)", "WaitingUser: id=UniqueConnecId(20)", "WaitingUser: id=UniqueConnecId(22)", "WaitingUser: id=UniqueConnecId(24)", "WaitingUser: id=UniqueConnecId(26)", "WaitingUser: id=UniqueConnecId(28)", "WaitingUser: id=UniqueConnecId(3)", "WaitingUser: id=UniqueConnecId(30)", "WaitingUser: id=UniqueConnecId(32)", "WaitingUser: id=UniqueConnecId(34)", "WaitingUser: id=UniqueConnecId(36)", "WaitingUser: id=UniqueConnecId(39)", "WaitingUser: id=UniqueConnecId(41)", "WaitingUser: id=UniqueConnecId(43)", "WaitingUser: id=UniqueConnecId(45)", "WaitingUser: id=UniqueConnecId(47)", "WaitingUser: id=UniqueConnecId(49)", "WaitingUser: id=UniqueConnecId(5)", "WaitingUser: id=UniqueConnecId(51)", "WaitingUser: id=UniqueConnecId(53)", "WaitingUser: id=UniqueConnecId(55)", "WaitingUser: id=UniqueConnecId(57)", "WaitingUser: id=UniqueConnecId(59)", "WaitingUser: id=UniqueConnecId(61)", "WaitingUser: id=UniqueConnecId(63)", "WaitingUser: id=UniqueConnecId(65)", "WaitingUser: id=UniqueConnecId(67)", "WaitingUser: id=UniqueConnecId(7)", "WaitingUser: id=UniqueConnecId(9)"]    

It seems that I need to acknowledge some of the requests but the API doesn't expect this. Also, I can send hundreds of get_closest_peers requests with no errors.

When an error begins to manifest the requesting peer gets a QueryStat:

QueryStats { requests: 1, success: 0, failure: 1, start: Some(Instant { t: 2021689972560 }), end: Some(Instant { t: 2021689984792 }) }

That indicates that the first out of two requests (it seems that start_providing issues FindNode and AddProvider requests) produces an error but numerous GetClosestPeers work ok. It doesn't make sense to me.

Also, when I increase an interval between start_providing API calls to 17secs - it doesn't seem to produce an error. I tried to set Kademlia query timeout from the default 60 to just 1 second (I suspected some pending process) but it doesn't make a difference.

Am I missing something?

Expected behaviour

I expect the local setup to handle multiple requests per second with no issues.

Actual behaviour

Debug Output

FindNodeReq: ConnectionId(2)
2022-10-20T11:43:57.299813Z  INFO subspace_networking::node_runner: Kademlia event: InboundRequest { request: FindNode { num_closer_peers: 1 } }
2022-10-20T11:43:57.300924Z DEBUG yamux::connection::stream: 44d40e66/139: eof    
2022-10-20T11:43:57.301396Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:57.301431Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:57.301473Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol    
2022-10-20T11:43:57.302384Z  INFO subspace_networking::behavior::custom_record_store: New provider record added: ProviderRecord { key: Key(b"\0 \x89\x06;^\xab\xa0hP\xc4\xb6\xc9W\x1cFV\xc8|\xce\xb6\x04&M%\xe9 y\xe3\x9a\x89\x81[\xa6"), provider: PeerId("12D3KooWSPnggmEVXnPYDjaueqRm5EjC1cVejyaRUg2V6BThVxpR"), expires: Some(Instant { t: 4087037477371 }), addresses: ["/ip4/127.0.0.1/tcp/60010"] }
2022-10-20T11:43:57.302474Z  INFO subspace_networking::node_runner: Kademlia event: InboundRequest { request: AddProvider { record: None } }
2022-10-20T11:43:57.302482Z  INFO subspace_networking::node_runner: Add provider request received: None
2022-10-20T11:43:57.899984Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:57.900226Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:58.899019Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:58.899206Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:59.000034Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:59.000098Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:43:59.000151Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol    
2022-10-20T11:43:59.000197Z  WARN libp2p_kad::handler: New inbound substream to PeerId("12D3KooWSPnggmEVXnPYDjaueqRm5EjC1cVejyaRUg2V6BThVxpR") exceeds inbound substream limit. No older substream waiting to be reused. Dropping new substream.    
2022-10-20T11:43:59.000496Z  WARN libp2p_kad::handler: Inbound streams: ["WaitingUser: id=UniqueConnecId(12)", "WaitingUser: id=UniqueConnecId(14)", "WaitingUser: id=UniqueConnecId(16)", "WaitingUser: id=UniqueConnecId(18)", "WaitingUser: id=UniqueConnecId(2)", "WaitingUser: id=UniqueConnecId(21)", "WaitingUser: id=UniqueConnecId(23)", "WaitingUser: id=UniqueConnecId(25)", "WaitingUser: id=UniqueConnecId(27)", "WaitingUser: id=UniqueConnecId(29)", "WaitingUser: id=UniqueConnecId(31)", "WaitingUser: id=UniqueConnecId(33)", "WaitingUser: id=UniqueConnecId(35)", "WaitingUser: id=UniqueConnecId(37)", "WaitingUser: id=UniqueConnecId(39)", "WaitingUser: id=UniqueConnecId(42)", "WaitingUser: id=UniqueConnecId(44)", "WaitingUser: id=UniqueConnecId(46)", "WaitingUser: id=UniqueConnecId(48)", "WaitingUser: id=UniqueConnecId(5)", "WaitingUser: id=UniqueConnecId(50)", "WaitingUser: id=UniqueConnecId(52)", "WaitingUser: id=UniqueConnecId(54)", "WaitingUser: id=UniqueConnecId(56)", "WaitingUser: id=UniqueConnecId(58)", "WaitingUser: id=UniqueConnecId(60)", "WaitingUser: id=UniqueConnecId(62)", "WaitingUser: id=UniqueConnecId(64)", "WaitingUser: id=UniqueConnecId(66)", "WaitingUser: id=UniqueConnecId(68)", "WaitingUser: id=UniqueConnecId(7)", "WaitingUser: id=UniqueConnecId(9)"]    
2022-10-20T11:43:59.898505Z DEBUG libp2p_gossipsub::behaviour: Starting heartbeat    
2022-10-20T11:43:59.898595Z DEBUG libp2p_gossipsub::behaviour: Completed Heartbeat    
2022-10-20T11:43:59.950796Z DEBUG subspace_networking::node_runner: Initiate connection to known peers local_peer_id=12D3L7AUynGJcx7Lb4SAo4bR1m78UYydB4TRy2Nre6vViLHPgEem connected_peers=1
2022-10-20T11:44:00.701763Z DEBUG multistream_select::listener_select: Listener: confirming protocol: /subspace/kad/0.1.0    
2022-10-20T11:44:00.701828Z DEBUG multistream_select::listener_select: Listener: sent confirmed protocol: /subspace/kad/0.1.0    
2022-10-20T11:44:00.701882Z DEBUG libp2p_core::upgrade::apply: Successfully applied negotiated protocol  

Possible Solution

Version

  • I use 0.46.1 in my own branch but the latest 0.49.0 produces the same result.

Would you like to work on fixing this bug?

No

@shamil-gadelshin
Copy link
Contributor Author

This issue I mentioned on the last community call. @mxinden @thomaseizinger You had already a similar discussion but this issue arises on a much smaller load.

@thomaseizinger
Copy link
Contributor

Thanks for reporting this! This definitely doesn't look right. I am putting it on my list to work on for next week :)

@thomaseizinger thomaseizinger self-assigned this Oct 28, 2022
@thomaseizinger
Copy link
Contributor

Are you polling the Swarm properly? Can you share the code that you are running?

I looked into the code and these substreams shouldn't really be sitting there idling for very long. They are waiting for the NetworkBehaviour to gather the data, which should be extremely quick and can then resume again.

Maybe the main event loop of the Behaviours is blocked from some reason? Can you reproduce the problem if you are just using the Kademlia behaviour and nothing else?

@thomaseizinger
Copy link
Contributor

Also, can you try and test with #3074? I think this one is more correct in terms of task wake behaviour. Maybe that is the issue.

@thomaseizinger
Copy link
Contributor

Friendly ping @shamil-gadelshin!

Does #3074 help at all?

@thomaseizinger thomaseizinger added the need/author-input Needs input from the original author label Nov 7, 2022
@shamil-gadelshin
Copy link
Contributor Author

I tried that. The issue stays. Sorry for the delayed response. I was going to debug it again and provide you with extensive feedback. I'm going to do it this week.

Thank you for your attention to this bug. I appreciate it.

@thomaseizinger thomaseizinger removed the need/author-input Needs input from the original author label Nov 8, 2022
@thomaseizinger
Copy link
Contributor

I tried that. The issue stays. Sorry for the delayed response. I was going to debug it again and provide you with extensive feedback. I'm going to do it this week.

Thank you for your attention to this bug. I appreciate it.

Ah damn. Any chance you can provide a minimal example that produces the problem?

@shamil-gadelshin
Copy link
Contributor Author

Sure. I will try again debugging my code to rule out a silly mistake and then try to reproduce the issue having the minimal setup.

@shamil-gadelshin
Copy link
Contributor Author

Here is the minimal Kademlia setup that shows differences between start_providing and put_record or get_closest_peers . It's possible that I misuse the Kademlia API and I appreciate a hint @thomaseizinger

@shamil-gadelshin
Copy link
Contributor Author

@thomaseizinger
Copy link
Contributor

Thank you! I'll have a look!

@shamil-gadelshin
Copy link
Contributor Author

a friendly ping @thomaseizinger Did you have a chance to look at the kad-example project? Does it reproduce the error?

@thomaseizinger
Copy link
Contributor

a friendly ping @thomaseizinger Did you have a chance to look at the kad-example project? Does it reproduce the error?

Sorry, I haven't yet but I'll do so today!

@thomaseizinger
Copy link
Contributor

Okay, I figured out what I think the issue is.

As per the spec, outbound substreams may be reused. Our implementation never does that (we only send 1 request per substream) but our inbound streams wait for additional messages on that stream and thus fill up the buffer.

There seems to be a bug where the implementation for the inbound stream does not detect that the other side closed the stream and it should thus stop waiting for a message.

@thomaseizinger
Copy link
Contributor

Damn, that is not the issue ...

@thomaseizinger
Copy link
Contributor

Okay, I have a fix.

The issue was that we had substream that were waiting for a response from the behaviour even though for AddProvider, there is no response. So instead of actually reusing the substeam, the substream was in WaitingUser state but that one never got answered.,

@thomaseizinger
Copy link
Contributor

I opened a fix here: #3152

With this patch, the example you provided no longer issues the warnings. Thanks for providing that example, it was really helpful in the debugging process!

@shamil-gadelshin
Copy link
Contributor Author

I will test the fix in the test project and in our main project as well. Thanks a lot!

@mergify mergify bot closed this as completed in #3152 Nov 23, 2022
mergify bot pushed a commit that referenced this issue Nov 23, 2022
Previously, we would erroneously always go into the `WaitingUser` (now called `WaitingBehaviour`) state after receiving a message on an inbound stream. However, the `AddProvider` message does not warrant a "response" from the behaviour. Thus, any incoming `AddProvider` message would result in a stale substream that would eventually be dropped as soon as more than 32 inbound streams have been opened.

With this patch, we inline the message handling into the upper match block and perform the correct state transition upon each message. For `AddProvider`, we go back into `WaitingMessage` because the spec mandates that we need to be ready to receive more messages on an inbound stream.

Fixes #3048.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants