Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible bitswap stall issue #5183

Closed
whyrusleeping opened this issue Jul 3, 2018 · 14 comments
Closed

possible bitswap stall issue #5183

whyrusleeping opened this issue Jul 3, 2018 · 14 comments
Labels
kind/bug A bug in existing code (including security flaws) topic/bitswap Topic bitswap

Comments

@whyrusleeping
Copy link
Member

In IRC, @fiatjaf reported that one of his ipfs nodes running 0.4.15 on a VPS was stalling trying to list out a particular directory. I confirmed that all the data was accessible, and even fetched it all to my local node. He then connected that VPS peer to my node, and it still couldnt fetch the data. The peers wantlist showed a single hash in it, that my node definitely had. If the nodes were actually connected successfully, then this implies a possible bug in bitswap.

Further questions I have here are around whether or not the fetch was using sessions. getting a stack dump of any node in this position would be nice too.

@fiatjaf
Copy link

fiatjaf commented Jul 3, 2018

Actually the wantlist output I pasted was from after I had restarted the node and the problem was solved. I don't know how it was before. or how many nodes were in the wantlist.

@skliarie
Copy link

skliarie commented Jul 5, 2018

I have same story with number of hashes. If using ipfs-go (with chrome's ipfs-companion), the hash is not downloading. If I disable ipfs-companion (thus enabling ipfs-js) - then download works fast.
How can I debug the issue or provide you with necessary logs?

@Stebalien
Copy link
Member

@skliarie

  • Are there any hashes listed in ipfs bitswap wantlist? Does your other machine have these hashes?
  • What's the output of ipfs swarm peers --streams?
  • What's the peer ID of the peer with the given hash?
  • Follow the instructions here. That'll help us determine if there are any odd deadlocks.

Also, try to reproduce with the latest release candidate running on both machines: https://dist.ipfs.io/go-ipfs/v0.4.16-rc1

@skliarie
Copy link

skliarie commented Jul 6, 2018

  1. (I ran ipfs get in separate window):

ipfs@ipfs1:~$ ipfs bitswap wantlist
QmbwEsezethaQhtrUosVCQFccn7Ze6KSENo2hnXv3aXfKP

  1. Output of "ipfs swarm peers --streams":
    ipfs_swarm_peers_streams.gz

  2. No idea how to get ID of the peer with the given hash. You can find it yourself, the hash is publicly available (e.g. ipfs-js can fetch it quickly).

  3. Find attached. Using ipfs 0.4.15
    ipfs.sysinfo.gz
    ipfs.stacks.gz
    ipfs.heap.gz
    ipfs.gz
    ipfs.cpuprof.gz

  4. Will try ipfs 0.4.16-rc1 shortly.

@skliarie
Copy link

skliarie commented Jul 6, 2018

Tested on 0.4.16-rc1 (amd64), same problem.
Attached debug data:
ipfs.sysinfo.gz
ipfs.stacks.gz
ipfs.heap.gz
ipfs.cpuprof.gz

@Stebalien
Copy link
Member

No idea how to get ID of the peer with the given hash. You can find it yourself, the hash is publicly available (e.g. ipfs-js can fetch it quickly).

Ah... So, js-ipfs doesn't use the DHT to find or announce content (last time I checked). I'm guessing:

  1. Nobody has announced that they have the hash in question to the DHT. A quick ipfs dht findprovs QmbwEsezethaQhtrUosVCQFccn7Ze6KSENo2hnXv3aXfKP yields no results.
  2. Your js-ipfs node is getting automatically connected to the node with the content in question.

Looking at the debug info, I don't see any obvious deadlocks/issues. Without knowing which node has the hash, it's a bit difficult to tell where the issue is.

@skliarie
Copy link

skliarie commented Jul 7, 2018

Something strange going on. Here is another interesting hash:
QmbeSiN8d7wxfonK5ahikVnwYmhJw14gfW4uNjrm8UEjW3

ipfs-js finds it pretty quickly, but ipfs-go has problems with it. Somehow, my public node QmTtggHgG1tjAHrHfBDBLPmUvn5BwNRpZY4qMJRXnQ7bQj (0.4.16-rc1) managed to download it in the past (e.g. "ipfs get" works), but getting it from another node (also 0.4.16-rc1) does not:

$ ipfs dht findprovs QmbeSiN8d7wxfonK5ahikVnwYmhJw14gfW4uNjrm8UEjW3
Error: routing service is not a DHT

What is going on? How could it be that ipfs-go and ipfs-js have different (incompatible?) routing services?

@Stebalien
Copy link
Member

That's a new bug, fixed in #5200. Could very well have caused the issue. To work around it, you can disable IPNS over pubsub.

However, that also wouldn't (as far as I know) be responsible for the original bug.

@Stebalien
Copy link
Member

One potential cause is a peer restart. That is, if one of the peers restarts but the other sees the new connection before seeing the old connection close, it won't re-send the wantlist.

We can fix this by either:

  1. Keeping state per stream (sending the entire wantlist every time we open a new stream).
  2. Using some form of bitswap "session" ID.

@ninkisa
Copy link

ninkisa commented Feb 19, 2019

Hello, is there a fix about this?We have a similar problem running ipfs in a private network

At one point ipfs just stops downloading
We are using ipfs-go and ipfs is running in a docker container
[machine02]$ docker exec ipfs_container ipfs version
ipfs version 0.4.18

Result from "ipfs bitswap wantlist" and ipfs swarm peers --streams
ipfs_bitswap.txt

Debug logs
ipfs_stacks.zip

Is it possible the reason to be the use of the quic protoc?

Thanks in advance for the support

@Stebalien Stebalien added kind/bug A bug in existing code (including security flaws) topic/bitswap Topic bitswap labels Feb 19, 2019
@Stebalien
Copy link
Member

QUIC shouldn't affect this. We're going to release a new release ASAP, probably by the end of the week with a completely refactored bitswap so let's see what that does for this.

@Stebalien
Copy link
Member

New information: @mattober has run into this issue. He has two nodes: A gateway and a "host" (storing the data).

The gateway shows two connections to the host, one IPv4, one IPv6. The IPv4 connection has an open DHT stream and the IPv6 connection has has an open DHT stream (!?) and an open bitswap stream.

The host shows one connection to the gateway (IPv4). This connection has an open DHT stream and an open relay stream (!?) and no bitswap stream.

@Stebalien
Copy link
Member

@hannahhoward
Copy link
Contributor

I feel like there is no specific actionable information on a current version of bitswap to work with here, given that it's been near rewritten completely since 0.4.15 and the only current potential problem referenced is peers not resending wantlists on a reconnect. My belief is that we've addressed this as best we can with the periodic wantlist rebroadcast. And beyond that really there's no further improvement beyond error correction in the protocol. So I am inclined to close this issue, understand for anyone following it that we are still pursuing avenues to address potential stalls on an ongoing basis as we identify potential issues in current code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/bitswap Topic bitswap
Projects
None yet
Development

No branches or pull requests

6 participants