Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High warnings rate on Kusama after 0.9.31-rc6 update due to libp2p_yamux #758

Closed
vstakhov opened this issue Nov 1, 2022 · 20 comments
Closed

Comments

@vstakhov
Copy link
Contributor

vstakhov commented Nov 1, 2022

After our validators got updated to this version, there is a constant flow of warnings in the module libp2p_yamux. The logs are all similar:

WARN tokio-runtime-worker libp2p_yamux: dropping (Stream a9bc8821/65) because buffer is full

The sample warnings rate graph is here: https://grafana.parity-mgmt.parity.io/goto/V_rs4jHVk?orgId=1

CC @dmitry-markin @altonen

@dmitry-markin
Copy link
Contributor

Looking at the place where this warning originates in libp2p, it looks like this is happening because a remote node is opening yamux substreams faster than our node is accepting them. The limit of requested but not yet accepted substreams is set to 25 here, and it seems to be the part of a substream backpressure machanism. It's likely that this is happening due to the overall high load of the node, and I don't quite understand how to optimize it now.

@tomaka do you have any ideas on the matter?

@tomaka
Copy link
Contributor

tomaka commented Nov 3, 2022

This is really poor coding, but given the API constraints of libp2p there's no better way to implement it.

The proper fix would be to rewrite the entire networking stack (including libp2p), but an alternative fix is to raise this limit. 25 seems extremely low to me no matter way. The struct seems to have a size of around 40-50 bytes, so limiting to 25 elements limits the memory usage to 1.2 kiB. Even if we set the limit to 250000 it would still be reasonable.

They'd probably accept a PR that makes this constant configurable.

@vstakhov
Copy link
Contributor Author

vstakhov commented Nov 3, 2022

But why has it happened just after upgrade? Was it a coincidence or was it somehow caused by some our change (e.g. paritytech/substrate#12434)?

@tomaka
Copy link
Contributor

tomaka commented Nov 3, 2022

Probably a change that causes more substreams to be opened at the same time, or changes the CPU profile of the node in such a way that there is now a longer delay between two polling of the networking, or something like that.
If anything, paritytech/substrate#12434 should have improved the situation.

@ordian
Copy link
Member

ordian commented Dec 20, 2022

We're observing a lot of these warnings recently on Kusama. We're also obverving occasionally these errors:

ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID

Looking at the code the latter seems to be a race condition with peer disconnecting possibly due to the same reason causing the warning.

The reputation changes logs for reputation changes by polkadot seem to be mostly about "Not interested in that parent hash" in bitfield-distribution, but they are not nearly enough to cause a disconnect.

@ordian
Copy link
Member

ordian commented Dec 21, 2022

image

We also higher finality lag. As you can see approval-checking lag is low and disputes-lag is zero. So that means GRANDPA votes are propagated or accounted for slower. Probably this error explains it:

ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID

@eskimor
Copy link
Member

eskimor commented Dec 21, 2022

This is about multiplexed streams on a single TCP connnection - right? Can we add some logging to see the protocol name of the stream that gets dropped?

@eskimor
Copy link
Member

eskimor commented Dec 21, 2022

Should already be fixed? Maybe just an uprade to 0.50 is necessary @dmitry-markin ?

@ordian
Copy link
Member

ordian commented Dec 21, 2022

Should already be fixed? Maybe just an uprade to 0.50 is necessary @dmitry-markin ?

The PR you linked was released in libp2p-yamux 0.41.1: https://github.com/libp2p/rust-libp2p/blob/v0.50.0/muxers/yamux/CHANGELOG.md and the recent polkadot release: https://github.com/paritytech/polkadot/blob/v0.9.36/Cargo.lock#L3727 has 0.41.0. So cargo update -p libp2p-yamux should do the trick if the fix is indeed relevant here.

@bkchr
Copy link
Member

bkchr commented Dec 21, 2022

@mxinden is it possible for libp2p to require the latest versions of their deps when doing a new release? This way we would have get these kind of updates automatically.

@ordian
Copy link
Member

ordian commented Dec 21, 2022

The Versi burn-in of paritytech/polkadot#6464 looks promising so far. The libp2p_yamux warnings have disappeared.

However, ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID still seem to be there.
We also observe:

WARN tokio-runtime-worker sync: 💔 Invalid block justification provided by 12D3KooWQREC4ZFVMRzn4sZnK2emxw1zQ6b4UaYQ9XXueNkzzXq5: requested: 0x070a207bc9970053529aaa3e90a7c7ae7509abb1733448583b58901acfb2ec2e got: 0x61c5297308d8a54489bc2d3db79f8d1dd60beefe600a9033dc04f3803c4dcadd
WARN tokio-runtime-worker sync: 💔 Error importing block <snip>: block has an unknown parent
WARN tokio-runtime-worker sync: 💔 Verification failed for block 0xb70f9820ad892689d2fdbfa9e2067348de4cfb96ae37ce880c36d4a93af81f36 received from peer: 12D3KooWLNZ8yhBLSD6dGHAio2RKsnFPNaBtk6cdagv9nxRHVsd1, "UnknownBlock: State already discarded for 0xe2887aa9553bcfa47ec80016f209e1c194244c9c263ea3c5f71c296a104d2150"

@bkchr
Copy link
Member

bkchr commented Dec 21, 2022

@ordian can we get a new issue for this with more logs? Maybe also from the other side that send this invalid justifications.

@mxinden
Copy link

mxinden commented Jan 2, 2023

Sorry for the trouble the initial patch caused here. Great to see that you are using libp2p-yamux >=v0.41.1.

@mxinden is it possible for libp2p to require the latest versions of their deps when doing a new release? This way we would have get these kind of updates automatically.

Given that downstream projects like Substrate need some mechanism to upgrade to patch versions of direct dependencies, why not use the same mechanism for dependencies of dependencies? E.g. why not use dependabot to notify and upgrade to patch releases of all dependencies?

We could release a new libp2p patch version for every libp2p-* patch version. In cases where the patch of the libp2p-* release does not apply to all, that would unnecessary restrict version choice of users that don't need the patch. With the above in mind, namely that downstream users need a mechanism to update direct dependencies anyways, is your proposal worth it?

@bkchr
Copy link
Member

bkchr commented Jan 2, 2023

We could release a new libp2p patch version for every libp2p-* patch version. In cases where the patch of the libp2p-* release does not apply to all, that would unnecessary restrict version choice of users that don't need the patch. With the above in mind, namely that downstream users need a mechanism to update direct dependencies anyways, is your proposal worth it?

I didn't mean a patch version of libp2p. I meant on every major/minor version release of libp2p you would pin the sub libp2p dependencies with their latest patch release.

@tdimitrov
Copy link
Contributor

ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a peer ID of an unknown peer occurred twice on Kusama this week.
Time range of the second occurrence -> 2023-04-08 12:27:01 to 2023-04-08 12:29:34 UTC.

@altonen
Copy link
Contributor

altonen commented Apr 10, 2023

@tdimitrov

This is a known issue. Happens because the response queue in syncing may contain "stale responses", responses from peers who have disconnected. I'll try to fix it this week.

@mxinden
Copy link

mxinden commented Apr 28, 2023

We could release a new libp2p patch version for every libp2p-* patch version. In cases where the patch of the libp2p-* release does not apply to all, that would unnecessary restrict version choice of users that don't need the patch. With the above in mind, namely that downstream users need a mechanism to update direct dependencies anyways, is your proposal worth it?

I didn't mean a patch version of libp2p. I meant on every major/minor version release of libp2p you would pin the sub libp2p dependencies with their latest patch release.

For the record, this will most likely be the new convention starting with the next breaking rust-libp2p release. libp2p/rust-libp2p#3715

@thomaseizinger
Copy link

Looking at the place where this warning originates in libp2p, it looks like this is happening because a remote node is opening yamux substreams faster than our node is accepting them. The limit of requested but not yet accepted substreams is set to 25 here, and it seems to be the part of a substream backpressure machanism. It's likely that this is happening due to the overall high load of the node, and I don't quite understand how to optimize it now.

Unfortunately, yamux doesn't have any backpressure mechanism on newly opened streams. Thus, if the current node is slower in accepting new streams than the remote is in opening new ones, our only choice is to drop new streams to avoid a resource exhaustion. There is unfortunately no correct magic number here on when to do that.

What we can do is try and have all nodes play nice and stop opening new streams when they realize that the other node is slow. In yamux, we call this the "ACK backlog" (opened but not yet acknowledged streams). With a recent refactoring in rust-yamux, this is now possible to implement. It is tracked here: libp2p/rust-yamux#150

We can then align our magic number to the magic number for the allowed ACK backlog (256 as per spec). At that point, I think we should effectively no longer drop streams anywhere, assuming all implementations adhere to the spec (and thus to the 256 ACK backlog limit).

Help in getting there would be much appreciated!

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
* Move to on_idle with weight check

* fix breaking tests

* fix tests

* Add comment about weight estimation

* Fix weight consumed

* Fix tests
helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024
…tech#758)

* Do not validate transaction cost on non-transactional context

* fmt
@lexnv
Copy link
Contributor

lexnv commented Jul 9, 2024

@vstakhov are we still getting this error with the latest libp2p version bump?

I did not see this running a local node for ~ 4 days with libp2p backend. And did not reproduce during versi-net triaging

@ordian ordian closed this as completed Jul 9, 2024
@github-project-automation github-project-automation bot moved this from Backlog 🗒 to Blocked ⛔️ in Networking Jul 9, 2024
@ordian
Copy link
Member

ordian commented Jul 9, 2024

Indeed, don't see these warnings on our Kusama validators in the past 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Blocked ⛔️
Development

No branches or pull requests