-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bitswap peer connection race #432
Comments
Thx a lot this is pretty clear, while working on #423 I was looking at changing this part of the code (backpressure free prefix-sum-like state versioning that would more aggressively use the |
@Jorropo I would also suggest: This seems like not the best behavior -- it should probably return an error to the caller to give the caller the opportunity to retry. |
I think this is fine because libp2p's .Connect call does not return before the .Connected callback has returned. This new code also have similar dedup logic and state transitions. Fixes #432
I think this is fine because libp2p's .Connect call does not return before the .Connected callback has returned. This new code also have similar dedup logic and state transitions. Fixes #432
I think this is fine because libp2p's .Connect call does not return before the .Connected callback has returned. This new code also have similar dedup logic and state transitions. Fixes #432
I think this is fine because libp2p's .Connect call does not return before the .Connected callback has returned. This new code also have similar dedup logic and state transitions. Fixes #432
I think this is fine because libp2p's .Connect call does not return before the .Connected callback has returned. This new code also have similar dedup logic and state transitions. Fixes #432
I think this is fine because libp2p's .Connect call does not return before the .Connected callback has returned. This new code also have similar dedup logic and state transitions. Fixes #432
* feat(connecteventmanager): block Connected() until accepted Ref: #432 Minimal attempt at solving #432 * fix(connecteventmanager): less complex channel signalling * fix(connecteventmanager): handle change queue edge cases and closure * fix(connecteventmanager): add test to confirm sync Connected() call flow
Per FIL Slack (https://filecoinproject.slack.com/archives/C03FFEVK30F/p1692628181061129?thread_ts=1692232716.286059&cid=C03FFEVK30F ), Kubo/Boxo maintainers will discuss/estimate next steps during 2023-08-22 standup and respond here by EOD 2023-08-22 as we understand this is important to fix for getting range requests landed for Rhea in Lassie. |
2023-08-22 maintainer conversation: General notes:
Next steps for @Jorropo for moving this forward:
Before getting to this though @Jorropo is tied up with:
@Jorropo doesn't expect he will have #327 completed until the end of the week, which means the estimate/decision/date won't come until week of 2023-08-28. Per slack message, Bedrock is unblocked for the short term. |
The previous maintainer conversation comment was updated to fill in more details and be clear that further work here won't get picked up until week of 2023-08-28. |
Want to note this is NOT a true statement. We use stock bitswap. We do not disable broadcasts. This has never been possible. The only replacement code we offer is to the FindProvidersAsync call, substituting IPNI results. |
In fact, to clarify our own logs show Bitswap continues to broadcast, meaning this error should effect ANY client. The problem is while it broadcasts want haves, it never switches to requesting a WANT-BLOCK. |
@hannahhoward wouldn't at-worst after 30s the wantlist rebroadcast happens and unblock the session, or after 2 RTTs ?
This would be wasting a roundtrip, but AFAIT is a quick win. |
Listen, I was surprised as heck that rebroadcast doesn't do anything. But that's what the logs show on our side. I believe the fundamental problem is that the PeerManager.SendWants fails silently when there is no peer queue. It should error and the calling code (SessionWantSender in this case) should figure out what to do with that. You can see the chain of rebroadcasts at the end of the log for this test run.
|
Ok intresting, what I was trying to get at is that something is left in a bad invalid state. |
What are the chances of this being looked at in the near future? We've had to pin Lassie to the commit I got into |
Hi @rvagg - we missed on answering your question sooner. If I understand correctly, the initial proposed fix was merged (#435 ) but then reverted (#444 ). This hasn't been a priority for Boxo maintainers to directly address, but we understand it puts Lassie in a tough spot. If there's a different approach that accounts for the feedback from the reverted attempt, I'll make sure maintainers give it review. Is #452 such an attempt? If so, please mark it out of draft when it's ready for review. Thanks. |
I believe #452 is just a proposal for how this issue might be addressed, not a proper fix, it needs to be taken further. Here's my understanding of the state of this issue:
Not sure what else is to be done about this with the limited people-bandwidth we have and the current state of prioritisation across our orgs. 🤷 |
Bitswap relies on "being connected" to mean that state has been fully propagated. In other words, this means that a call to |
I can also confirm that this bug exists and that it has some far-reaching implications. I'm currently on a private libp2p network with 600 peers, and after actual file synchronization speed tests, I've seen significant rate increases with #452. In the course of testing, I discovered an interesting phenomenon:
From a quick look at boxo's code, I realize that |
* feat(connecteventmanager): block Connected() until accepted Ref: #432 Minimal attempt at solving #432 * fix(connecteventmanager): less complex channel signalling * fix(connecteventmanager): handle change queue edge cases and closure * fix(connecteventmanager): add test to confirm sync Connected() call flow changelog: put the 435 fix in the right version fix(connecteventmanager): clean up tests for new synchronous flow
* feat(connecteventmanager): block Connected() until accepted Ref: #432 Minimal attempt at solving #432 * fix(connecteventmanager): less complex channel signalling * fix(connecteventmanager): handle change queue edge cases and closure * fix(connecteventmanager): add test to confirm sync Connected() call flow changelog: put the 435 fix in the right version fix(connecteventmanager): clean up tests for new synchronous flow
Experienced via Lassie's CI suite which has suddenly become flaky with some new workloads. Using the testing/sws-logging branch here (started by @hannahhoward, continued by me) to diagnose with additional debug printing. It evidences in an inability to communicate with a peer that we know has the block, and bitswap has been told has the block.
See logs below, which come from here using 715365a on the testing/sws-logging branch. Here's what I believe happens:
session#findMorePeers
->ProviderQueryManager#FindProvidersAsync
ProviderQueryManager#findProviderWorker
does its thing, processing the queue of providers to connect to, performs apqm.network.ConnectTo()
which is goes through (bitswap/network/ipfs_impl.go)bsnet#ConnectTo()
to just do a basichost.Connect()
Connected()
notification which is received by (bitswap/network/ipfs_impl.go)connectEventManager#Connected()
and it does ac.setState(p, stateResponsive)
on that peer.ProviderQueryManager#FindProvidersAsync
has returned, theConnect()
returned, we even got aConnected()
, so theSession
invokessessionWantSender#Update(peer, nil, have, nil)
, i.e. thepeer
has the cidhave
, this is a "change".sessionWantSender
we go fromonChange()
toprocessUpdates()
toupdateWantBlockPresence()
towantInfo#setPeerBlockPresence()
towantInfo#calculateBestPeer()
- we see these in the logsprocessUpdates()
,sessionWantSender
moves on tosendNextWants()
, which compiles them and callssendWants()
which in turn callsPeerManager#SendWants()
.PeerManager
doesn't have apeerQueues
entry for this peer .. yet ..connectEventManager
, theworker()
(goroutine) loop finally picks up a change, and sees thatp
is nowstateResponsive
so it ends up callingBitswap#PeerConnected()
which callsClient#PeerConnected()
which callsPeerManager#Connected()
which callsPeerManager#getOrCreate()
that finally sets up apeerQueues
entry for that peer; so it's now ready to talk ("hey, what'd I miss?").I'm a bit fuzzy on what the logs show beyond here and why it doesn't manage to rectify the situation with retries; I guess because it had one chance to take the peer from
FindProvidersAsync
and ask it for the block, and it failed that one chance and there isn't really another.The async disconnect in
connectEventManager
seems to be the main problem here, perhaps a call toConnected()
shouldn't return until the state change has been fully propagated.Relevant logs
The text was updated successfully, but these errors were encountered: