-
Notifications
You must be signed in to change notification settings - Fork 943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused. #3048
Kademlia. New inbound substream to PeerId exceeds inbound substream limit. No older substream waiting to be reused. #3048
Comments
This issue I mentioned on the last community call. @mxinden @thomaseizinger You had already a similar discussion but this issue arises on a much smaller load. |
Thanks for reporting this! This definitely doesn't look right. I am putting it on my list to work on for next week :) |
Are you polling the I looked into the code and these substreams shouldn't really be sitting there idling for very long. They are waiting for the Maybe the main event loop of the |
Also, can you try and test with #3074? I think this one is more correct in terms of task wake behaviour. Maybe that is the issue. |
Friendly ping @shamil-gadelshin! Does #3074 help at all? |
I tried that. The issue stays. Sorry for the delayed response. I was going to debug it again and provide you with extensive feedback. I'm going to do it this week. Thank you for your attention to this bug. I appreciate it. |
Ah damn. Any chance you can provide a minimal example that produces the problem? |
Sure. I will try again debugging my code to rule out a silly mistake and then try to reproduce the issue having the minimal setup. |
Here is the minimal Kademlia setup that shows differences between |
Thank you! I'll have a look! |
a friendly ping @thomaseizinger Did you have a chance to look at the kad-example project? Does it reproduce the error? |
Sorry, I haven't yet but I'll do so today! |
Okay, I figured out what I think the issue is. As per the spec, outbound substreams may be reused. Our implementation never does that (we only send 1 request per substream) but our inbound streams wait for additional messages on that stream and thus fill up the buffer. There seems to be a bug where the implementation for the inbound stream does not detect that the other side closed the stream and it should thus stop waiting for a message. |
Damn, that is not the issue ... |
Okay, I have a fix. The issue was that we had substream that were waiting for a response from the behaviour even though for |
I opened a fix here: #3152 With this patch, the example you provided no longer issues the warnings. Thanks for providing that example, it was really helpful in the debugging process! |
I will test the fix in the test project and in our main project as well. Thanks a lot! |
Previously, we would erroneously always go into the `WaitingUser` (now called `WaitingBehaviour`) state after receiving a message on an inbound stream. However, the `AddProvider` message does not warrant a "response" from the behaviour. Thus, any incoming `AddProvider` message would result in a stale substream that would eventually be dropped as soon as more than 32 inbound streams have been opened. With this patch, we inline the message handling into the upper match block and perform the correct state transition upon each message. For `AddProvider`, we go back into `WaitingMessage` because the spec mandates that we need to be ready to receive more messages on an inbound stream. Fixes #3048.
Summary
I use
start_providing
Kademlia APIHere is my error after 32 requests:
This error is similar to the recent discussion: #2957
However, I have only 2 peers (local machine setup) and 3 seconds between
start_providing
requests.Here is my local
inbound_streams
buffer:It seems that I need to acknowledge some of the requests but the API doesn't expect this. Also, I can send hundreds of
get_closest_peers
requests with no errors.When an error begins to manifest the requesting peer gets a QueryStat:
That indicates that the first out of two requests (it seems that
start_providing
issues FindNode and AddProvider requests) produces an error but numerousGetClosestPeers
work ok. It doesn't make sense to me.Also, when I increase an interval between
start_providing
API calls to 17secs - it doesn't seem to produce an error. I tried to set Kademlia query timeout from the default 60 to just 1 second (I suspected some pending process) but it doesn't make a difference.Am I missing something?
Expected behaviour
I expect the local setup to handle multiple requests per second with no issues.
Actual behaviour
Debug Output
Possible Solution
Version
0.46.1
in my own branch but the latest0.49.0
produces the same result.Would you like to work on fixing this bug?
No
The text was updated successfully, but these errors were encountered: