generated from ipfs/ipfs-repository-template
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bitswap/server: wantlist overflows fails in a toxic maner preventing any data transfer #527
Closed
Comments
Jorropo
added
need/triage
Needs initial labeling and prioritization
P1
High: Likely tackled by core team if no one steps up
kind/bug
A bug in existing code (including security flaws)
and removed
need/triage
Needs initial labeling and prioritization
labels
Dec 19, 2023
Jorropo
changed the title
Bitswap server fails in a toxic maner when the client exceed 1024 entries in the wantlist by a *lot* preventing any data transfer
bitswap/server: wantlist overflows fails in a toxic maner preventing any data transfer
Dec 19, 2023
I don't think it was written down anywhere, but me and @aschmahmann had verbal conversation
|
gammazero
added a commit
that referenced
this issue
Jun 21, 2024
bitswap wantlist overflow handling now selects newer entries when truncating wantlist. This fix prevents the retained portion of the wantlist from filling up with CIDs that the server does not have. Fixes #527
gammazero
added a commit
that referenced
this issue
Jun 21, 2024
bitswap wantlist overflow handling now selects newer entries when truncating wantlist. This fix prevents the retained portion of the wantlist from filling up with CIDs that the server does not have. Fixes #527
gammazero
added a commit
that referenced
this issue
Jun 23, 2024
wantlist overflow handling now cancels existing entries to make room for newer requests. This fix prevents the wantlist from filling up with CIDs that the server does not have. Fixes #527
gammazero
added a commit
that referenced
this issue
Jun 23, 2024
wantlist overflow handling now cancels existing entries to make room for newer requests. This fix prevents the wantlist from filling up with CIDs that the server does not have. Fixes #527
This was referenced Jun 24, 2024
gammazero
added a commit
that referenced
this issue
Jul 3, 2024
wantlist overflow handling now cancels existing entries to make room for newer requests. This fix prevents the wantlist from filling up with CIDs that the server does not have. Fixes #527
wenyue
pushed a commit
to wenyue/boxo
that referenced
this issue
Oct 17, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is a bug I introduced in 9cb5cb5 when fixing CVE-2023-25568.
Here is a screenshot of our gateway's wantlists:
You can see that many gateways instances have more than 1024 entries.
This cause issues with:
boxo/bitswap/server/internal/decision/engine.go
Lines 669 to 677 in 9cb5cb5
boxo/bitswap/server/internal/decision/engine.go
Line 765 in 9cb5cb5
boxo/bitswap/server/internal/decision/engine.go
Line 851 in 9cb5cb5
Theses three sections of code are needed to fix CVE-2023-25568, they make the server ignore any new queries overflowing 1024. Previously the server would remember infinitely many CIDs eventually OOMing.
However the truncation code always prefer keeping existing entries over new ones.
This means we can get in a situation where we get stuck:
DONT_HAVE
.The point of this feature is for
-1
scalling, if the server is also downloading the same blocks, it might get them after having already sentDONT_HAVE
, then the server can either send the block or the aHAVE
message overriding the previousDONT_HAVE
.Eventually reaching 0
(note: the client can send CANCEL or a message with the full flag and theses 1024 "stuck" CIDs will be cleaned out properly, but the client isn't smart enough to realize this is happening)
I see three possible solutions:
Truncate left instead of truncating right.
This means newer entries would override older ones, instead of having older ones stick around.
My original thinking when writing the fix is that the server currently does not apply back-pressure, so we should let existing entries first so we can make some progress and clear them out when we send responses. If we always cancel what is already in flight a client that is too fast would never get any blocks because everything would be canceled before being sent.
This could be fixed by only canceling what is already queued but not entries that are actively being worked on.
Completely rewrite the server and architecture it request response
If we handled messages in a request response manner we could apply back-pressure when the client is sending too many queries.
We could still have a global
CID → [PeerID]
wantlist map however this would be limited to CIDs we don't have, for-1
scalling. CIDs we already have would be completely skip this flow and the queue and be handled purely in the message handler.We would still need a limit on how many
-1
tracked CIDs we have, but that means this bug would only impact blocks we don't have which SGTM.Make the client aware and handle rotation itself by truncating itself and handle canceling overwrote entries.
This is nice because it does not require updating the servers, only the client, so fixed client can download from buggy servers.
Sounds PITA to code and I don't really want to do it tbh. Also this means we commit to having broke the protocol.
The text was updated successfully, but these errors were encountered: