Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kademlia bootstrap gets stuck forever in some cases #5432

Closed
nazar-pc opened this issue May 30, 2024 · 6 comments
Closed

Kademlia bootstrap gets stuck forever in some cases #5432

nazar-pc opened this issue May 30, 2024 · 6 comments

Comments

@nazar-pc
Copy link
Contributor

We were getting user reports (and I reproduced it myself a few times) that sync in our protocol gets stuck forever. Implementation-wise it is based on libp2p and I have now tried it many times and discovered that sometimes Kademlia bootstrap gets stuck.

Specifically logs look like this:

2024-05-30T12:34:46.857429Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=1
2024-05-30T12:34:58.093234Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=2
2024-05-30T12:35:10.464213Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=3
2024-05-30T12:35:22.352168Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=4
2024-05-30T12:35:46.703529Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=5
2024-05-30T12:36:00.706340Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=6

2024-05-30T12:36:24.450606Z ERROR Consensus: yamux::connection: 8dbe1bae: socket error: decode error: i/o error: unexpected end of file
2024-05-30T12:45:37.098341Z  INFO ...

Successful bootstrapping takes ~3 minutes and looks like this:

2024-05-30T12:38:59.643710Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=0
2024-05-30T12:39:10.653836Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=1
2024-05-30T12:39:21.742561Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=2
2024-05-30T12:39:42.694996Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=3
2024-05-30T12:39:55.236701Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=4
2024-05-30T12:40:07.208918Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=5
2024-05-30T12:40:28.087114Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=6
2024-05-30T12:40:49.569203Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=7
2024-05-30T12:41:01.281595Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=8
2024-05-30T12:41:11.808284Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=9
2024-05-30T12:41:32.355007Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=10
2024-05-30T12:41:52.687556Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=11
2024-05-30T12:42:03.053404Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=12
2024-05-30T12:42:13.888330Z  INFO Consensus: subspace_networking::node_runner: Kademlia bootstrapping... bootstrap_step=13
2024-05-30T12:42:13.888377Z  INFO Consensus: subspace_networking::node_runner: Bootstrap finished

bootstrap_step above corresponds to KademliaEvent::OutboundQueryProgressed events and if KademliaEvent::OutboundQueryProgressed.step.last == true is called Bootstrap finished is printed.

I believe Yamux error is not related to this, we're seeing them periodically and they don't seem to break any protocols fundamentally.

Now I'm wondering if it is related in any way to #5418 somehow due to underlying lookups done in both cases.

libp2p-kad 0.45.3 (latest at the moment of writing), disjoint_query_paths: true

@nazar-pc
Copy link
Contributor Author

nazar-pc commented May 31, 2024

Here is RUST_LOG=trace of an app that is trying to do Kademlia bootstrapping (and nothing else):
benchmark-6-share.log.zip
And gdb backtrace of the process that printed no longs and was stuck for all intents and purposes for hours (prints no longs anymore):
benchmark-6.backtrace.log.zip

Can be reproduced after multiple attempts with https://github.com/subspace/subspace/tree/dsn-sync-getting-stuck-wip by running:

cargo run --example benchmark -- --bootstrap-nodes=/dns/bootstrap-0.gemini-3h.subspace.network/tcp/30533/p2p/12D3KooWK7NuL4S6aEdy5gELnvhCGo6EyrWVARnBy7W4AJTVkaF1 --bootstrap-nodes=/dns/bootstrap-1.gemini-3h.subspace.network/tcp/30533/p2p/12D3KooWQK33n2raSXzjH8JyqbFtczBmbwZiK9Tpicdw3rveJesj --protocol-version=0c121c75f4ef450f40619e1fca9d1e8e7fbabc42c895bc4790801e85d5a91c34 --out-peers=100 --pending-out-peers=100 simple --max-pieces=1 --start-with=256 --retries=0

@nazar-pc
Copy link
Contributor Author

Tried to race bootstrapping with a simple tokio timer future and wsa not able to reproduce this after countless attempts, while it doesn't take too many attempts to reproduce otherwise 🤔

@stormshield-frb
Copy link
Contributor

Hi @nazar-pc. Sorry for the issue you encounter because I think it's my fault :/

I had open a PR to fix it (#5349) but, having not been active alot on Github these last weeks, it did not move forward very much.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Jun 5, 2024

Interesting, it does look related. I hope it will be merged soon, subscribed. Thanks!

@stormshield-frb
Copy link
Contributor

Should be resolved by #5349 which has been merged today. If you agree @nazar-pc I think we can close this issue.

@nazar-pc
Copy link
Contributor Author

Let's close, I'll let you know if the issue remains after we upgrade to fixed version at Subspace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants