Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Transfer Stall #7972

Closed
Stebalien opened this issue Mar 11, 2021 · 28 comments
Closed

Data Transfer Stall #7972

Stebalien opened this issue Mar 11, 2021 · 28 comments
Assignees
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up

Comments

@Stebalien
Copy link
Member

Stebalien commented Mar 11, 2021

Update (May 28th): We have deployed #8082 to monitor BS with the latest fix ipfs/go-bitswap#477. We are still running into the 'this request was not processed in time' error but this is likely a false positive due to an incorrect timeout setting in the monitoring tool that we should calibrate for a GW node in production. We need to make this setting a normal IPFS config that can be modified through the standard CLI to avoid restarting the node and better calibrate the timeout. (There's a tribute issue about it that might be picked up next week: #8157.)


Version information:

go-ipfs v0.8.0

Description:

We saw a bitswap stall on some of the preload nodes where they were unable to download any content, even from connected nodes (ipfs/js-ipfs#3577).

This looks related to #5183 but that issue was mostly about specific content. That is, a specific request would fail, and we'd get into a stuck state where we couldn't download that specific content.

The preload nodes did not appear to be spinning, stuck, etc. but restarting them fixed the issue.

Possible causes:

  • Deadlock somewhere.
  • Full wantlist? Maybe we have some cases where wants are getting stuck in the wantlist and never clearing, building up over time?
@Stebalien Stebalien added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Mar 11, 2021
@aschmahmann aschmahmann added P1 High: Likely tackled by core team if no one steps up need/analysis Needs further analysis before proceeding and removed need/triage Needs initial labeling and prioritization labels Apr 5, 2021
@BigLep
Copy link
Contributor

BigLep commented Apr 5, 2021

@aschmahmann is going to provide more context/info.

@aschmahmann
Copy link
Contributor

There was a report that a node was not sending data over bitswap despite having the data locally. The user was able to run ipfs block stat --offline <cid> and get a result, and yet the data seemed not to transfer over Bitswap.

I investigated further using vole https://github.com/aschmahmann/vole to directly send the node a Bitswap request such as:

vole.exe bitswap check <cid> <peer-multiaddr> and got the response {"Found":false,"Responded":false,"Error":null}, which means we did not get a response or an error on whether we received the other peer has the data or not. IIRC the peer was still able to respond to requests from other peers about other CIDs, but I don't recall at the moment what other types of probing I tried from my machine 😟.

@Stebalien
Copy link
Member Author

Matt noted that disabling QUIC seemed to help. This issue could be related to libp2p/go-libp2p-quic-transport#57 (where we could be failing to timeout somewhere in bitswap, backing everything up).

@schomatis schomatis self-assigned this Apr 12, 2021
@schomatis
Copy link
Contributor

Although I have a vague understanding of the high level issue I don't find enough information to translate this into any actionable item.

  • Is the idea (at least for the moment) to make a speculative static analysis based on the limited information or are there tests or dynamic analysis where I could get more information?
    • Static: I don't have much experience with bitswap code so I'm going to need some pointers to a couple of entry places if the analysis should go that way.
    • Dynamic: can I test this locally?
      • If yes, can I run a single ipfs daemon node or do I need to set up a more involved local network?
      • If no, do I need access to the gateways?

@aschmahmann
Copy link
Contributor

aschmahmann commented Apr 13, 2021

Dynamic: can I test this locally?

I do not know of a way to reproduce this issue. My recollection from some comments @gmasgras made are that the situation we saw was.

  • Gateway A fetches CID X from the network (or our cluster, I forget)
  • Gateway B is peered to A
  • B tries to fetch X and fails even though A has the data
    • Running ipfs block stat --offline X would give us the data
    • Running vole bitswap check X Aaddress gives {"Found":false,"Responded":false,"Error":null}

It might be that simulating this situation + check on the gateway once every 30 min would at least let us know if this continues to happen and could enable more live debugging on the node. cc @coryschwartz who's started on some scripts that might help here.

Static: I don't have much experience with bitswap code so I'm going to need some pointers to a couple of entry places if the analysis should go that way.

Ya, this is harder. The place I'd investigate is tracking how an inbound message gets put into a queue and gets dropped, stuck forever or for a very long time, etc. (IIRC I've waited over 30 minutes on a response). This means starting with the message handling code and tracing how it enters and leaves the queue.

@coryschwartz
Copy link

Adding some content to an IPFS node and trying to read it through the gateways, some of the content would throw a HTTP 504 after several minutes.

I setup a second IPFS node with the routing disabled and peered it only to the gateways.

What I found is that some of the gateways responded with an error some of the time. Using vole, and a list of CIDs I had added earlier, I could check the status for each gateway. Sometimes, running this test results in successes for all CIDS on all gateways. Other times, a gateway will stall for several minutes before finally throwing an error.

Running this rudementary script to check each gateway against a set of CIDS

GATEWAYS=(
        "/dns/gateway-bank1-am6.dwebops.net/tcp/4001/p2p/12D3KooWEfQkss14RPkFoGCby5ZRo5F2UYEMEfSAbYUdjeAE6Dk3"
        "/dns/gateway-bank1-ams1.dwebops.net/tcp/4001/p2p/QmXAdzKCUZTg9SLLRK6nrfyhJmK1S43n2y5EBeGQ7zqhfW"
        "/dns/gateway-bank1-dc13.dwebops.net/tcp/4001/p2p/12D3KooWJfq7u5cYc65HAZT3mE3irDH5KYSUdFxXTdf2KTB7FQgf"
        "/dns/gateway-bank1-mrs1.dwebops.net/tcp/4001/p2p/QmYZmw1PQXtKVeAjRfn8TnKYAurSdezRd1kdDjenzxTF6y"
        "/dns/gateway-bank1-nrt1.dwebops.net/tcp/4001/p2p/QmQ41LAQUdH6JQCeSR1qMT8qdPAkZ9XgkvqYHs9HiAdh5Z"
        "/dns/gateway-bank1-ny5.dwebops.net/tcp/4001/p2p/12D3KooWDUGVYVaJKmGR3eJCbgihEYgYAfpnZqJYzhzeaJrLxq6j"
        "/dns/gateway-bank1-sg1.dwebops.net/tcp/4001/p2p/12D3KooWGoaeFe2h4uhTEAhaN6yTccUMn5uizxvyRaWkd2x8ZYHP"
        "/dns/gateway-bank1-sjc1.dwebops.net/tcp/4001/p2p/QmXYHU1M8t86gRKvUmE42smybPKXqTPEUPDUaCVtLmS6Bw"
        "/dns/gateway-bank2-ams1.dwebops.net/tcp/4001/p2p/QmRzfM4kLwk6AjnZVmFra9RdMguWyjU4j8tctNz6dzxjxc"
        "/dns/gateway-bank2-dc13.dwebops.net/tcp/4001/p2p/12D3KooWAAW7p19cyLb39w5V5AuvX4c5gTPtQv8RuZo2tCKeFpZ7"
        "/dns/gateway-bank2-mrs1.dwebops.net/tcp/4001/p2p/QmSPz3WfZ1xCq6PCFQj3xFHAPBRUudbogcDPSMtwkQzxGC"
        "/dns/gateway-bank2-nrt1.dwebops.net/tcp/4001/p2p/QmXygeCqKjpcbT9E7jgSbAvcDnPaJp9gQDzEVFZ6zzjVc2"
        "/dns/gateway-bank2-ny5.dwebops.net/tcp/4001/p2p/12D3KooWDBprpmGqQay9f513WvQ9cZfXAtW1TiqPP2fpuEGnZsqh"
        "/dns/gateway-bank2-sg1.dwebops.net/tcp/4001/p2p/12D3KooWDK1MAZuySwEv49Zmm7JVgsUgjGwZVB8Kdzdwr7mLcNL2"
        "/dns/gateway-bank2-sjc1.dwebops.net/tcp/4001/p2p/QmVLb8bi8oAmLwrTdjGurtVGN7FaJvhXP5UM3s56uyqfxL"
)

for gw in "${GATEWAYS[@]}"
do
        (
        echo "---------------"
        echo checking $gw
        for cid in $(cat pinned.txt)
        do
                echo -n "${cid} "
                ./vole bitswap check $cid $gw 2>&1
                echo "return status: $?"
        done
        ) | tee $(sed 's/\//_/g'<<<"${gw}.tst") &
done

wait

The problem doesn't seem to persist. If I have trouble htting a CID from one of the gateways once, and then try again later the problem has resolved itself.

This might not always be the case, but when I ran this, only saw errors on one specific gateway node, and I captured a profile of the node running there in case it's useful.

https://gist.github.com/coryschwartz/3eaae387933f16ee7b455ffd3eb6d122

one server gateway does take on more load than the rest, and it is not uncommon for the nginx response time to exhaust the full 10 minute timeout. For all gateways, the P99 response time for nginx 10 minutes (the nginx timeout). location:nginx_http_response_time_ms:99p

load averages for gateways

of course load follows with other metrics, throughput, etc.

I still need to look farther to understand exactly what's going on here, but I wonder if just adding some load shedding to the gateways would make the service more performent for users. Nginx could be configured to try locally, and after a short timeout, try another node.

@schomatis
Copy link
Contributor

@coryschwartz I need more information to reproduce this, could you please expand on:

I setup a second IPFS node with the routing disabled and peered it only to the gateways.

and the file pinned.txt from your script.

@schomatis
Copy link
Contributor

@aschmahmann I need more precision as to what a 'bitswap stall' means. Is it the {"Found":false,"Responded":false,"Error":null} response? Meaning we are connected to the peer, receiving messages but no message reports either having or not having the requested CID (which I assume is expected by bitswap upcoming spec, although I would want to know within which time window that is).

In that case I can consistently get that response from GW gateway-bank1-sjc1.dwebops.net with "hello world" CID:

(with ipfs-shipyard/vole#1)

vole bitswap check QmQPeNsJPyVWPFDVHb77w8G42Fvo15z4bG2X8D2GhfbSXc /dns/gateway-bank1-sjc1.dwebops.net/tcp/4001/p2p/QmXYHU1M8t86gRKvUmE42smybPKXqTPEUPDUaCVtLmS6Bw

@aschmahmann
Copy link
Contributor

I need more precision as to what a 'bitswap stall' means.

I guess what we're looking for is a node not responding to Bitswap requests for a long time. If the node is under severe load then there may not be much to be done. However, if we can get multiple rounds wantlist messages before hearing an answer to our request then (unless the load is on the disk) load probably isn't the issue.

Vole responding with {"Responded":false} is just a way of indicating that we hit a timeout without receiving any response at all, even though we're already connected to the peer.

It looks like that request is resolving now and I didn't have much time to see the error, but generally seeing that is a bad sign. Vole right now (without your PR) is a little impatient, I suspect adding a bit more debuggability/configurability on timeouts could be useful.

@gmasgras
Copy link

I still need to look farther to understand exactly what's going on here, but I wonder if just adding some load shedding to the gateways would make the service more performent for users

We currently do load shedding by limiting the number of concurrent connections and requests so that we're not oversubscribed on CPU (load < threads)

If the node is under severe load then there may not be much to be done

The node is under heavier load when compared to others but it's not overloaded.

@schomatis
Copy link
Contributor

@aschmahmann 👍 if you agree:

The problem is the lack of reponse from the GW (either HAVE or DONT-HAVE) after a period of time we still can't be sure of and are temporarily agreeing on an arbitrary 10 seconds. (This is reflected in the output of the vole tool as {"Responded":false}.)

We still can't trigger this consistently. The previous tests on GW gateway-bank1-sjc1.dwebops.net are now passing (i.e., the GW responds). (Speculation: what changed is that the GW now has the CID and responds with a HAVE but was missing to report the DONT-HAVE before. Will continue to look at the code.)

@aschmahmann
Copy link
Contributor

(Speculation: what changed is that the GW now has the CID and responds with a HAVE but was missing to report the DONT-HAVE before. Will continue to look at the code.)

This is possible, however when I was initially testing this the node that wasn't responding did actually have the CID already.

@obo20
Copy link

obo20 commented Apr 14, 2021

I wanted to hop in here because I saw a lot of the focus getting thrown at Gateway nodes.

When we encountered this issue it wasn't just gateway nodes that were running into issues. When a storage node (a node that's storing our IPFS content long term) ran into this issue, it wouldn't serve content to any other node.

So for example:

  1. Storage Node A runs into this bug
  2. Pinata adds content to Storage Node A
  3. Gateway Node B, Storage Node C, and Storage Node D are all trying to get this content from Storage Node A
  4. None of these nodes can retrieve the content.
  5. All future content added to Storage Node A cannot be found until it is reset

As far as I'm able to tell, the issue lies with the node that has the content needing to be retrieved, not on the nodes looking to retrieve that data.

@aschmahmann
Copy link
Contributor

aschmahmann commented Apr 14, 2021

As far as I'm able to tell, the issue lies with the node that has the content needing to be retrieved, not on the nodes looking to retrieve that data.

👍, yep in the gateway example above the issue was also on the node holding the data (in that case one of the gateways was holding it and another should've been retrieving it from the first gateway node).

All future content added to Storage Node A cannot be found until it is reset

This is interesting in that if we're talking about the same bug then this should make it easier to diagnose as once it occurs it should be sticky. Perhaps 10 seconds is too low of a threshold to set here for "when is it stalled", since within an hour the issue went from detectable to non-detectable on the gateway node above.

@obo20 a few clarifying questions:

  1. How are you determining when the bug is occurring on your nodes?
  2. How often is it happening?
  3. Has this been happening even after upgrading to v0.8.0 (I recall there were some QUIC fixes for another bug you ran into)? This is mostly a check to make sure we're talking about the same bug.

@obo20
Copy link

obo20 commented Apr 14, 2021

How are you determining when the bug is occurring on your nodes?

Previously we would notice that content wasn't replicating from certain nodes or wasn't able to be pulled from the gateways after upload

How often is it happening?

This was happening seemingly randomly at first it was every couple of days, but then as we slowly transitioned our nodes over to disabling quic, we noticed it didn't happen anymore. Now that we have QUIC disabled on all host nodes, we haven't run into this at all anymore. (this may be pure coincidence though)

Has this been happening even after upgrading to v0.8.0 (I recall there were some QUIC fixes for another bug you ran into)?

This is mostly a check to make sure we're talking about the same bug.
Yes, it happened a few times since 0.8.0 (we forgot to disable QUIC on a few nodes and have since changed that).

@RubenKelevra
Copy link
Contributor

Just an idea: Maybe it's not a network issue, but a database one?

Had some issues a while ago where a node could not access some CIDs while clearly in the possession in them after running the GC which somehow messed up the database.

On a restart the issue was gone for a while and reappeared after some hours.

#7844

@obo20
Copy link

obo20 commented Apr 17, 2021

I'm skeptical about this being related to GC. When we hit this, we hadn't run GC since restarting the node.

Unless you mean the node had been GC'd at some point in its life.

@obo20
Copy link

obo20 commented Apr 19, 2021

Just ran into this again. This was a node that had accidentally been overlooked when wanting to turn off quic for our network.

I ran the go-ipfs/bin/.collect-profiles script on this node and can share if needed.

Something else to consider since this seems to be related to quic. Almost all of our nodes have multi-terabyte datastores and are storing hundreds of thousands/millions of root CIDs. I'm not sure if this would have any effect on things, but it seemed potentially useful to point out.

@Stebalien
Copy link
Member Author

I ran the go-ipfs/bin/.collect-profiles script on this node and can share if needed.

Please do. I want to see if we're stuck opening and/or reading from a stream somewhere.

@Stebalien
Copy link
Member Author

@obo20 gave us a stack trace with the culprit:

goroutine 366 [select, 952 minutes]:
github.com/lucas-clemente/quic-go.(*outgoingBidiStreamsMap).OpenStreamSync(0xc02a9689a0, 0x22464e0, 0xc0018217a0, 0x0, 0x0, 0x0, 0x0)
	github.com/lucas-clemente/quic-go@v0.19.3/streams_map_outgoing_bidi.go:91 +0x245
github.com/lucas-clemente/quic-go.(*streamsMap).OpenStreamSync(0xc04c389c80, 0x22464e0, 0xc0018217a0, 0xfc0000c0262848e8, 0xb60248, 0x0, 0xc026284888)
	github.com/lucas-clemente/quic-go@v0.19.3/streams_map.go:112 +0x47
github.com/lucas-clemente/quic-go.(*session).OpenStreamSync(0xc01def4000, 0x22464e0, 0xc0018217a0, 0x1290854, 0xc000103e38, 0xc000f5c000, 0xc000227e30)
	github.com/lucas-clemente/quic-go@v0.19.3/session.go:1673 +0x48
github.com/libp2p/go-libp2p-quic-transport.(*conn).OpenStream(0xc01da0a480, 0x22464e0, 0xc0018217a0, 0x0, 0xc0243de5f8, 0xc000103e38, 0x20be368)
	github.com/libp2p/go-libp2p-quic-transport@v0.10.0/conn.go:41 +0x4b
github.com/libp2p/go-libp2p-swarm.(*Conn).NewStream(0xc01da0a500, 0x22464e0, 0xc0018217a0, 0xc01da0a500, 0xc0262849a8, 0x2, 0x2)
	github.com/libp2p/go-libp2p-swarm@v0.4.0/swarm_conn.go:180 +0x4c
github.com/libp2p/go-libp2p-swarm.(*Swarm).NewStream(0xc000103e00, 0x22464e0, 0xc0018217a0, 0xc000227e30, 0x22, 0xc000103e38, 0x20be368, 0xc026284a40, 0x12908ff)
	github.com/libp2p/go-libp2p-swarm@v0.4.0/swarm.go:354 +0x19b
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).NewStream(0xc0000ee580, 0x22464e0, 0xc0018217a0, 0xc000227e30, 0x22, 0xc000f65e00, 0x4, 0x4, 0x0, 0x0, ...)
	github.com/libp2p/go-libp2p@v0.13.0/p2p/host/basic/basic_host.go:601 +0xab
github.com/libp2p/go-libp2p/p2p/host/routed.(*RoutedHost).NewStream(0xc0016d2100, 0x22464e0, 0xc0018217a0, 0xc000227e30, 0x22, 0xc000f65e00, 0x4, 0x4, 0xc00004a800, 0xc026284c90, ...)
	github.com/libp2p/go-libp2p@v0.13.0/p2p/host/routed/routed.go:185 +0xfd
github.com/ipfs/go-bitswap/network.(*impl).newStreamToPeer(...)
	github.com/ipfs/go-bitswap@v0.3.3/network/ipfs_impl.go:329
github.com/ipfs/go-bitswap/network.(*impl).SendMessage(0xc000f3d4a0, 0x22464e0, 0xc0018217a0, 0xc000227e30, 0x22, 0x2264ca0, 0xc084968cf0, 0x8f2f12c0, 0xc08f2f12c0)
	github.com/ipfs/go-bitswap@v0.3.3/network/ipfs_impl.go:315 +0x9f
github.com/ipfs/go-bitswap.(*Bitswap).sendBlocks(0xc000201300, 0x22464e0, 0xc0018217a0, 0xc084968d80)
	github.com/ipfs/go-bitswap@v0.3.3/workers.go:116 +0xb6
github.com/ipfs/go-bitswap.(*Bitswap).taskWorker(0xc000201300, 0x22464e0, 0xc0018217a0, 0x0)
	github.com/ipfs/go-bitswap@v0.3.3/workers.go:62 +0x425
github.com/ipfs/go-bitswap.(*Bitswap).startWorkers.func1(0x225b200, 0xc0016cf920)
	github.com/ipfs/go-bitswap@v0.3.3/workers.go:25 +0x45
github.com/jbenet/goprocess.(*process).Go.func1(0xc001821da0, 0xc0016cf920, 0xc0016cf980)
	github.com/jbenet/goprocess@v0.1.4/impl-mutex.go:134 +0x3c
created by github.com/jbenet/goprocess.(*process).Go
	github.com/jbenet/goprocess@v0.1.4/impl-mutex.go:133 +0x1fe

This is happening because QUIC blocks forever if it can't open a stream (e..g, too many open streams). So:

  1. We need to pass a context when opening streams with a relatively small timeout.
  2. QUIC should really have an internal timeout so this kind of thing can't happen.

@schomatis: the first fix is to make newStreamToPeer take a context with a short timeout (e.g., 5s?).

@Stebalien
Copy link
Member Author

Full stacks attached
goroutines.stacks.gz.

Take a look for blocked NewStream calls.

@Stebalien
Copy link
Member Author

Note: there could be multiple issues here, but we can at least fix this one and move on.

@aschmahmann
Copy link
Contributor

This should be fixed once we get go-ipfs to use a newer version of go-bitswap with ipfs/go-bitswap#477 included.

@aschmahmann aschmahmann added status/blocked Unable to be worked further until needs are met and removed need/analysis Needs further analysis before proceeding labels May 3, 2021
@BigLep

This comment has been minimized.

@schomatis

This comment has been minimized.

@BigLep BigLep removed the status/blocked Unable to be worked further until needs are met label May 17, 2021
@PierreJeanjacquot
Copy link

Hi there,
if this information can help to track down the root issue, the problem described in ipfs/js-ipfs#3577 is currently happening on node2.preload.ipfs.io

@Stebalien
Copy link
Member Author

It's running the old version of go-ipfs (before the fixes). This'll be a good chance to check.

@aschmahmann
Copy link
Contributor

Closing for now, we'll reopen if we see continued reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up
Projects
None yet
Development

No branches or pull requests

9 participants