High warnings rate on Kusama after 0.9.31-rc6 update due to libp2p_yamux #758

vstakhov · 2022-11-01T20:35:45Z

After our validators got updated to this version, there is a constant flow of warnings in the module libp2p_yamux. The logs are all similar:

WARN tokio-runtime-worker libp2p_yamux: dropping (Stream a9bc8821/65) because buffer is full

The sample warnings rate graph is here: https://grafana.parity-mgmt.parity.io/goto/V_rs4jHVk?orgId=1

CC @dmitry-markin @altonen

The text was updated successfully, but these errors were encountered:

dmitry-markin · 2022-11-03T10:13:58Z

Looking at the place where this warning originates in libp2p, it looks like this is happening because a remote node is opening yamux substreams faster than our node is accepting them. The limit of requested but not yet accepted substreams is set to 25 here, and it seems to be the part of a substream backpressure machanism. It's likely that this is happening due to the overall high load of the node, and I don't quite understand how to optimize it now.

@tomaka do you have any ideas on the matter?

tomaka · 2022-11-03T12:24:14Z

This is really poor coding, but given the API constraints of libp2p there's no better way to implement it.

The proper fix would be to rewrite the entire networking stack (including libp2p), but an alternative fix is to raise this limit. 25 seems extremely low to me no matter way. The struct seems to have a size of around 40-50 bytes, so limiting to 25 elements limits the memory usage to 1.2 kiB. Even if we set the limit to 250000 it would still be reasonable.

They'd probably accept a PR that makes this constant configurable.

vstakhov · 2022-11-03T13:32:56Z

But why has it happened just after upgrade? Was it a coincidence or was it somehow caused by some our change (e.g. paritytech/substrate#12434)?

tomaka · 2022-11-03T13:36:42Z

Probably a change that causes more substreams to be opened at the same time, or changes the CPU profile of the node in such a way that there is now a longer delay between two polling of the networking, or something like that.
If anything, paritytech/substrate#12434 should have improved the situation.

ordian · 2022-12-20T21:53:18Z

We're observing a lot of these warnings recently on Kusama. We're also obverving occasionally these errors:

ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID

Looking at the code the latter seems to be a race condition with peer disconnecting possibly due to the same reason causing the warning.

The reputation changes logs for reputation changes by polkadot seem to be mostly about "Not interested in that parent hash" in bitfield-distribution, but they are not nearly enough to cause a disconnect.

ordian · 2022-12-21T05:09:10Z

We also higher finality lag. As you can see approval-checking lag is low and disputes-lag is zero. So that means GRANDPA votes are propagated or accounted for slower. Probably this error explains it:

ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID

eskimor · 2022-12-21T09:31:50Z

This is about multiplexed streams on a single TCP connnection - right? Can we add some logging to see the protocol name of the stream that gets dropped?

eskimor · 2022-12-21T14:05:38Z

Should already be fixed? Maybe just an uprade to 0.50 is necessary @dmitry-markin ?

ordian · 2022-12-21T14:31:46Z

Should already be fixed? Maybe just an uprade to 0.50 is necessary @dmitry-markin ?

The PR you linked was released in libp2p-yamux 0.41.1: https://github.com/libp2p/rust-libp2p/blob/v0.50.0/muxers/yamux/CHANGELOG.md and the recent polkadot release: https://github.com/paritytech/polkadot/blob/v0.9.36/Cargo.lock#L3727 has 0.41.0. So cargo update -p libp2p-yamux should do the trick if the fix is indeed relevant here.

bkchr · 2022-12-21T14:59:25Z

@mxinden is it possible for libp2p to require the latest versions of their deps when doing a new release? This way we would have get these kind of updates automatically.

ordian · 2022-12-21T18:15:47Z

The Versi burn-in of paritytech/polkadot#6464 looks promising so far. The libp2p_yamux warnings have disappeared.

However, ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID still seem to be there.
We also observe:

WARN tokio-runtime-worker sync: 💔 Invalid block justification provided by 12D3KooWQREC4ZFVMRzn4sZnK2emxw1zQ6b4UaYQ9XXueNkzzXq5: requested: 0x070a207bc9970053529aaa3e90a7c7ae7509abb1733448583b58901acfb2ec2e got: 0x61c5297308d8a54489bc2d3db79f8d1dd60beefe600a9033dc04f3803c4dcadd
WARN tokio-runtime-worker sync: 💔 Error importing block <snip>: block has an unknown parent
WARN tokio-runtime-worker sync: 💔 Verification failed for block 0xb70f9820ad892689d2fdbfa9e2067348de4cfb96ae37ce880c36d4a93af81f36 received from peer: 12D3KooWLNZ8yhBLSD6dGHAio2RKsnFPNaBtk6cdagv9nxRHVsd1, "UnknownBlock: State already discarded for 0xe2887aa9553bcfa47ec80016f209e1c194244c9c263ea3c5f71c296a104d2150"

bkchr · 2022-12-21T22:07:18Z

@ordian can we get a new issue for this with more logs? Maybe also from the other side that send this invalid justifications.

mxinden · 2023-01-02T17:32:49Z

Sorry for the trouble the initial patch caused here. Great to see that you are using libp2p-yamux >=v0.41.1.

@mxinden is it possible for libp2p to require the latest versions of their deps when doing a new release? This way we would have get these kind of updates automatically.

Given that downstream projects like Substrate need some mechanism to upgrade to patch versions of direct dependencies, why not use the same mechanism for dependencies of dependencies? E.g. why not use dependabot to notify and upgrade to patch releases of all dependencies?

We could release a new libp2p patch version for every libp2p-* patch version. In cases where the patch of the libp2p-* release does not apply to all, that would unnecessary restrict version choice of users that don't need the patch. With the above in mind, namely that downstream users need a mechanism to update direct dependencies anyways, is your proposal worth it?

bkchr · 2023-01-02T22:49:45Z

We could release a new libp2p patch version for every libp2p-* patch version. In cases where the patch of the libp2p-* release does not apply to all, that would unnecessary restrict version choice of users that don't need the patch. With the above in mind, namely that downstream users need a mechanism to update direct dependencies anyways, is your proposal worth it?

I didn't mean a patch version of libp2p. I meant on every major/minor version release of libp2p you would pin the sub libp2p dependencies with their latest patch release.

tdimitrov · 2023-04-08T12:46:35Z

ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a peer ID of an unknown peer occurred twice on Kusama this week.
Time range of the second occurrence -> 2023-04-08 12:27:01 to 2023-04-08 12:29:34 UTC.

altonen · 2023-04-10T07:22:30Z

@tdimitrov

This is a known issue. Happens because the response queue in syncing may contain "stale responses", responses from peers who have disconnected. I'll try to fix it this week.

mxinden · 2023-04-28T12:05:28Z

We could release a new libp2p patch version for every libp2p-* patch version. In cases where the patch of the libp2p-* release does not apply to all, that would unnecessary restrict version choice of users that don't need the patch. With the above in mind, namely that downstream users need a mechanism to update direct dependencies anyways, is your proposal worth it?

I didn't mean a patch version of libp2p. I meant on every major/minor version release of libp2p you would pin the sub libp2p dependencies with their latest patch release.

For the record, this will most likely be the new convention starting with the next breaking rust-libp2p release. libp2p/rust-libp2p#3715

thomaseizinger · 2023-05-03T07:04:55Z

Looking at the place where this warning originates in libp2p, it looks like this is happening because a remote node is opening yamux substreams faster than our node is accepting them. The limit of requested but not yet accepted substreams is set to 25 here, and it seems to be the part of a substream backpressure machanism. It's likely that this is happening due to the overall high load of the node, and I don't quite understand how to optimize it now.

Unfortunately, yamux doesn't have any backpressure mechanism on newly opened streams. Thus, if the current node is slower in accepting new streams than the remote is in opening new ones, our only choice is to drop new streams to avoid a resource exhaustion. There is unfortunately no correct magic number here on when to do that.

What we can do is try and have all nodes play nice and stop opening new streams when they realize that the other node is slow. In yamux, we call this the "ACK backlog" (opened but not yet acknowledged streams). With a recent refactoring in rust-yamux, this is now possible to implement. It is tracked here: libp2p/rust-yamux#150

We can then align our magic number to the magic number for the allowed ACK backlog (256 as per spec). At that point, I think we should effectively no longer drop streams anywhere, assuming all implementations adhere to the spec (and thus to the 256 ACK backlog limit).

Help in getting there would be much appreciated!

* Move to on_idle with weight check * fix breaking tests * fix tests * Add comment about weight estimation * Fix weight consumed * Fix tests

…tech#758) * Do not validate transaction cost on non-transactional context * fmt

lexnv · 2024-07-09T10:35:21Z

@vstakhov are we still getting this error with the latest libp2p version bump?

I did not see this running a local node for ~ 4 days with libp2p backend. And did not reproduce during versi-net triaging

ordian · 2024-07-09T11:07:04Z

Indeed, don't see these warnings on our Kusama validators in the past 7 days.

coderobe mentioned this issue Nov 2, 2022

Polkadot v0.9.31 Release checklist paritytech/polkadot#6169

Closed

14 tasks

altonen added the U2-some_time_soon label Dec 14, 2022

ordian mentioned this issue Dec 16, 2022

av-store: write meta for unknown finalized blocks paritytech/polkadot#6452

Merged

2 tasks

ordian mentioned this issue Dec 21, 2022

cargo update -p libp2p-yamux paritytech/polkadot#6464

Merged

Imod7 mentioned this issue Apr 20, 2023

Finality Lagging issues in Kusama archive & full nodes paritytech/substrate#13295

Closed

2 tasks

thomaseizinger mentioned this issue May 3, 2023

feat(yamux): increase max number of buffered inbound streams to 256 libp2p/rust-libp2p#3872

Closed

4 tasks

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

the-right-joyce removed the U2-some_time_soon label Aug 25, 2023

helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024

Do not validate transaction cost on non-transactional context (parity…

bfc9070

…tech#758) * Do not validate transaction cost on non-transactional context * fmt

ordian closed this as completed Jul 9, 2024

github-project-automation bot moved this from Backlog 🗒 to Blocked ⛔️ in Networking Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High warnings rate on Kusama after 0.9.31-rc6 update due to libp2p_yamux #758

High warnings rate on Kusama after 0.9.31-rc6 update due to libp2p_yamux #758

vstakhov commented Nov 1, 2022

dmitry-markin commented Nov 3, 2022

tomaka commented Nov 3, 2022

vstakhov commented Nov 3, 2022

tomaka commented Nov 3, 2022

ordian commented Dec 20, 2022 •

edited

Loading

ordian commented Dec 21, 2022

eskimor commented Dec 21, 2022

eskimor commented Dec 21, 2022 •

edited

Loading

ordian commented Dec 21, 2022 •

edited

Loading

bkchr commented Dec 21, 2022

ordian commented Dec 21, 2022

bkchr commented Dec 21, 2022

mxinden commented Jan 2, 2023

bkchr commented Jan 2, 2023

tdimitrov commented Apr 8, 2023

altonen commented Apr 10, 2023

mxinden commented Apr 28, 2023

thomaseizinger commented May 3, 2023

lexnv commented Jul 9, 2024

ordian commented Jul 9, 2024

High warnings rate on Kusama after 0.9.31-rc6 update due to libp2p_yamux #758

High warnings rate on Kusama after 0.9.31-rc6 update due to libp2p_yamux #758

Comments

vstakhov commented Nov 1, 2022

dmitry-markin commented Nov 3, 2022

tomaka commented Nov 3, 2022

vstakhov commented Nov 3, 2022

tomaka commented Nov 3, 2022

ordian commented Dec 20, 2022 • edited Loading

ordian commented Dec 21, 2022

eskimor commented Dec 21, 2022

eskimor commented Dec 21, 2022 • edited Loading

ordian commented Dec 21, 2022 • edited Loading

bkchr commented Dec 21, 2022

ordian commented Dec 21, 2022

bkchr commented Dec 21, 2022

mxinden commented Jan 2, 2023

bkchr commented Jan 2, 2023

tdimitrov commented Apr 8, 2023

altonen commented Apr 10, 2023

mxinden commented Apr 28, 2023

thomaseizinger commented May 3, 2023

lexnv commented Jul 9, 2024

ordian commented Jul 9, 2024

ordian commented Dec 20, 2022 •

edited

Loading

eskimor commented Dec 21, 2022 •

edited

Loading

ordian commented Dec 21, 2022 •

edited

Loading