-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High warnings rate on Kusama after 0.9.31-rc6 update due to libp2p_yamux #758
Comments
Looking at the place where this warning originates in libp2p, it looks like this is happening because a remote node is opening yamux substreams faster than our node is accepting them. The limit of requested but not yet accepted substreams is set to 25 here, and it seems to be the part of a substream backpressure machanism. It's likely that this is happening due to the overall high load of the node, and I don't quite understand how to optimize it now. @tomaka do you have any ideas on the matter? |
This is really poor coding, but given the API constraints of libp2p there's no better way to implement it. The proper fix would be to rewrite the entire networking stack (including libp2p), but an alternative fix is to raise this limit. 25 seems extremely low to me no matter way. The struct seems to have a size of around 40-50 bytes, so limiting to 25 elements limits the memory usage to 1.2 kiB. Even if we set the limit to 250000 it would still be reasonable. They'd probably accept a PR that makes this constant configurable. |
But why has it happened just after upgrade? Was it a coincidence or was it somehow caused by some our change (e.g. paritytech/substrate#12434)? |
Probably a change that causes more substreams to be opened at the same time, or changes the CPU profile of the node in such a way that there is now a longer delay between two polling of the networking, or something like that. |
We're observing a lot of these warnings recently on Kusama. We're also obverving occasionally these errors:
Looking at the code the latter seems to be a race condition with peer disconnecting possibly due to the same reason causing the warning. The reputation changes logs for reputation changes by polkadot seem to be mostly about "Not interested in that parent hash" in bitfield-distribution, but they are not nearly enough to cause a disconnect. |
This is about multiplexed streams on a single TCP connnection - right? Can we add some logging to see the protocol name of the stream that gets dropped? |
Should already be fixed? Maybe just an uprade to 0.50 is necessary @dmitry-markin ? |
The PR you linked was released in libp2p-yamux 0.41.1: https://github.com/libp2p/rust-libp2p/blob/v0.50.0/muxers/yamux/CHANGELOG.md and the recent polkadot release: https://github.com/paritytech/polkadot/blob/v0.9.36/Cargo.lock#L3727 has 0.41.0. So |
@mxinden is it possible for libp2p to require the latest versions of their deps when doing a new release? This way we would have get these kind of updates automatically. |
The Versi burn-in of paritytech/polkadot#6464 looks promising so far. The However,
|
@ordian can we get a new issue for this with more logs? Maybe also from the other side that send this invalid justifications. |
Sorry for the trouble the initial patch caused here. Great to see that you are using
Given that downstream projects like Substrate need some mechanism to upgrade to patch versions of direct dependencies, why not use the same mechanism for dependencies of dependencies? E.g. why not use dependabot to notify and upgrade to patch releases of all dependencies? We could release a new |
I didn't mean a patch version of |
|
This is a known issue. Happens because the response queue in syncing may contain "stale responses", responses from peers who have disconnected. I'll try to fix it this week. |
For the record, this will most likely be the new convention starting with the next breaking rust-libp2p release. libp2p/rust-libp2p#3715 |
Unfortunately, What we can do is try and have all nodes play nice and stop opening new streams when they realize that the other node is slow. In yamux, we call this the "ACK backlog" (opened but not yet acknowledged streams). With a recent refactoring in We can then align our magic number to the magic number for the allowed ACK backlog (256 as per spec). At that point, I think we should effectively no longer drop streams anywhere, assuming all implementations adhere to the spec (and thus to the 256 ACK backlog limit). Help in getting there would be much appreciated! |
* Move to on_idle with weight check * fix breaking tests * fix tests * Add comment about weight estimation * Fix weight consumed * Fix tests
…tech#758) * Do not validate transaction cost on non-transactional context * fmt
@vstakhov are we still getting this error with the latest libp2p version bump? I did not see this running a local node for ~ 4 days with libp2p backend. And did not reproduce during versi-net triaging |
Indeed, don't see these warnings on our Kusama validators in the past 7 days. |
After our validators got updated to this version, there is a constant flow of warnings in the module
libp2p_yamux
. The logs are all similar:The sample warnings rate graph is here: https://grafana.parity-mgmt.parity.io/goto/V_rs4jHVk?orgId=1
CC @dmitry-markin @altonen
The text was updated successfully, but these errors were encountered: