Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WNFS data root updates can take > 60s with **some** accounts #489

Open
5 tasks
expede opened this issue Apr 11, 2021 · 19 comments
Open
5 tasks

WNFS data root updates can take > 60s with **some** accounts #489

expede opened this issue Apr 11, 2021 · 19 comments

Comments

@expede
Copy link
Member

expede commented Apr 11, 2021

Reproduced By

  • Brooke
  • Philipp

Environment

OSes: Arch, macOS
Browsers: Firefox, Safari

Ruled Out

  • IPFS Cluster
  • Web API
  • DNS
  • SDK
  • Application code
@expede
Copy link
Member Author

expede commented Apr 11, 2021

Issue occurs across all browsers and OSes, but only some accounts. It's consistent with those ones.

This looks like something happening on the server-side, though we still need to validate that the request is actually sent and not blocked by a service worker or similar.


From conversation, we had a hypothesis that it could be the UCAN validation when it's deeply nested. This doesn't appear to be the case from benchmarks, so we can rule that out:

    RSA-based UCAN checks
      One   Level:     OK (1.77s)
         53 μs ± 4.1 μs
      Two   Levels:    OK (4.70s)
         72 μs ± 3.8 μs
      Three Levels:    OK (3.93s)
        118 μs ± 3.8 μs

@matheus23
Copy link
Member

Hypothesis: This is related to general file system size.

I have this hunch, because there should be lots of history entries, due to lots of publish calls (which were automatically called in flatmate when any change occurred).

Not sure how to test this best. Here are a couple of ideas:

  1. Create a new account and spam publish calls and changes and log the time PATCH requests take. Hypothesis should be confirmed when we see a steady increase in PATCH request times.
  2. Log ipfs request performance server-side
  3. Copy over the matheus23-drive file system (by grabbing the DNSLink CID) & read key and test another account to see if it still happens. Reasoning for this is: This is very likely something happening on the server. And the only account we can reproduce this with at the moment lives on production and we want to reproduce this somewhere else ideally. Also it double-checks that this is nothing related to the particular account.
  4. Try pulling the account data and looking at the actual size? We have no idea how big it actually is: IPFS only gives us the inflated 533 Petabytes number.
  5. Override the user's file system with an empty one (Ideally we keep a backup of the old one locally)

This looks like something happening on the server-side, though we still need to validate that the request is actually sent and not blocked by a service worker or similar.

I just tested that quickly by bypassing all server-worker related things in chromium, no changes. It seems unrelated to service workers.

However, it could definitely still be an issue client-side! It could be js-ipfs on the client behaving weirdly, causing the changes to be uploaded really slowly to the server. If js-ipfs were uploading changes very slowly, then the server would wait for CIDs from the client to be finally replicated on the server, which would stall the PATCH request.

@matheus23
Copy link
Member

Try pulling the account data and looking at the actual size? We have no idea how big it actually is: IPFS only gives us the inflated 533 Petabytes number.

Turns out this is actually fairly easy:

# My laptop's repo size before:
$ ipfs repo stat
NumObjects: 2191
RepoSize:   88415581
StorageMax: 10000000000
RepoPath:   /home/philipp/.ipfs
Version:    fs-repo@10
# Just fetch my DNSLink:
$ ipfs swarm connect /dns4/node.runfission.com/tcp/4001/ipfs/QmVLEz2SxoNiFnuyLpbXsH6SvjPTrHNMU88vCQZyhgBzgw
connect QmVLEz2SxoNiFnuyLpbXsH6SvjPTrHNMU88vCQZyhgBzgw success
$ time ipfs pin add bafybeihtsvlsojwo6gi6ea5jo73uwaqla7rcbjmhniul5i7cxkle4c66fq
pinned bafybeihtsvlsojwo6gi6ea5jo73uwaqla7rcbjmhniul5i7cxkle4c66fq recursively

real	5m43,197s
user	0m0,416s
sys	0m0,078s
# Repo size after
$ ipfs repo stat
NumObjects: 29066
RepoSize:   487033005
StorageMax: 10000000000
RepoPath:   /home/philipp/.ipfs
Version:    fs-repo@10

So the change is 398,617424 Megabytes.

@matheus23
Copy link
Member

I've listed the history for the public and private file system:

list = await (await fs.get("private/")).history.list(10000)
publicList = await (await fs.get("public/")).history.list(10000)
list.length
// -> 244
publicList.length
// -> 1810

Listing the public history took quite a while!


I think I know how I got so many public history entries: I've uploaded lots of fission.apps via the dashboard. To do that, the dashboard first adds all files to be uploaded to the user's public file system one-by-one. This causes a history entry for every single file uploaded.

@matheus23
Copy link
Member

One more thing I tested: I changed the matheus23-drive WNFS and pinned the new DNSLink to test how fast updates would be:

$ time ipfs pin add bafybeiah26dwevhoogdaefmrpxi5o24e2nb7kz37rxjo63rh2jq7vfooii
pinned bafybeiah26dwevhoogdaefmrpxi5o24e2nb7kz37rxjo63rh2jq7vfooii recursively

real	0m3,986s
user	0m0,322s
sys	0m0,074s

Answer: Fast 🤷‍♂️

@expede
Copy link
Member Author

expede commented Apr 13, 2021

@matheus23 knowing these numbers is awesome 👏 Thanks for designing an experiment & spot testing this!

5m42s ~343s for 398.6MB
So, depending on how you want to look at it: 1.16MB/s = 0.86s/MB

That seems actually good but also not bad, it's very roughly 10Mbps. (That said, my fiber connection is more like 280Mbps, so YMMV 😛 #fiberbrag)

I think I know how I got so many public history entries: I've uploaded lots of fission.apps via the dashboard. To do that, the dashboard first adds all files to be uploaded to the user's public file system one-by-one. This causes a history entry for every single file uploaded.

This isn't necessarily a bad default, but yes, grouping the files makes sense, and is the plan 👍

real 0m3,986s
user 0m0,322s
sys 0m0,074s

Answer: Fast 🤷‍♂️

Yeah, this is more in line with what I would expect: only syncing the changes.

I wonder if our hosted nodes are having a hard time fetching the data from the browser for some reason? Does upload get throttled? Do they have a hard time connecting to the browser's node? We can see the upload timings in the logs with streaming updates (we set them to be pretty granular while we sort out a few things).

@expede
Copy link
Member Author

expede commented Apr 13, 2021

Listing the public history took quite a while!

What was the order of magnitude? Seconds? Minutes? Hours?

@expede
Copy link
Member Author

expede commented Apr 13, 2021

$ ipfs swarm connect /dns4/node.runfission.com/tcp/4001/ipfs/QmVLEz2SxoNiFnuyLpbXsH6SvjPTrHNMU88vCQZyhgBzgw
connect QmVLEz2SxoNiFnuyLpbXsH6SvjPTrHNMU88vCQZyhgBzgw success
$ time ipfs pin add bafybeihtsvlsojwo6gi6ea5jo73uwaqla7rcbjmhniul5i7cxkle4c66fq

I ran this just to see. It's been going for... quite some time (maybe half an hour so far?). I'll post if/when it ends 😛 ipfs repo stat indicates that something is happening

@expede
Copy link
Member Author

expede commented Apr 13, 2021

Screen Shot 2021-04-12 at 7 09 21 PM

ipfs repo also agrees on ~393.5MB

Okay, so the same experiment, similar results for the diff. 52 minutes to download the ~400MB on my fancy fast internet (130kB/s... YIKES).

Fetching the updated version which contains the first took all of one second and some change, as expected. So it's fast on diffs ⚡

@expede
Copy link
Member Author

expede commented Apr 13, 2021

More odd data: looked at CPU and network on AWS during those times. Essentially flat lines of low utilization.

@matheus23
Copy link
Member

More odd data: looked at CPU and network on AWS during those times. Essentially flat lines of low utilization.

This might be an indication that the issue is at the js-ipfs side, not the go-ipfs side on the server. I.e. all the "bitswapping" is being slow in the browser.

I talked through this with james today. We don't really see how there should be lots of CIDs requested on small file changes.

What was the order of magnitude? Seconds? Minutes? Hours?

Roughly a little more than 30 minutes, but I haven't timed it.

@expede
Copy link
Member Author

expede commented Apr 13, 2021

cc @walkah we need to figure out why go-ipfs is running so slowly... sometimes? Can we tweak anything on AWS? Why does network out spike much higher sometimes? Lots of questions!

@expede
Copy link
Member Author

expede commented Apr 16, 2021

Just referencing here in case it turns out to be related: ipfs/js-ipfs#2197

@expede expede changed the title WNFS data root updates can take > 60s with _some accounts._ WNFS data root updates can take > 60s with **some** accounts Apr 16, 2021
@matheus23
Copy link
Member

Just referencing here in case it turns out to be related

Yeah, during search I found that too, but it turns out it's a stale issue. There's a merged PR linked in it which landed in js-ipfs 0.42, and we're using js-ipfs 0.52.2.

@expede
Copy link
Member Author

expede commented Apr 19, 2021

Another related instance from this weekend:

  • Attempted to upload a ~68MB video to both staging and production Drive. It never finished (waited a couple hours)
  • Refreshed the page, and uploaded a ~6.5MB still image, and it uploaded in a couple seconds
    • Tested: able to view image in a different browser 👍
  • Tried the video again, and it never finished 👎

Is part of the bottleneck here the browser trying to load the large asset from the file system? I need to look closer.


I was able to reproduce it just now:

Screen Shot 2021-04-19 at 10 18 49 AM

@expede
Copy link
Member Author

expede commented Apr 21, 2021

The plot thickens. Via the CLI (i.e. not in browser, go-ipfs, sparse DAG structure, fiber connection), a FreeBSD iso (~1.3GB) added & uploaded in 8m13s 🤷‍♀️

@matheus23
Copy link
Member

matheus23 commented Apr 22, 2021

Another related instance from this weekend:

I'm not quite sure this is the same issue. Is the video upload stuck on the PATCH request for the data root?

Just to be clear, this is not the originally affected account, right? And it also only happens with bigger changes, right? The original issue was already triggered by small changes, so I think it's a separate issue.

@expede
Copy link
Member Author

expede commented Apr 22, 2021

I'm not quite sure this is the same issue. Is the video upload stuck on the PATCH request for the data root?

I was able to previously reproduce the speed issue with the above method. Now it's working quickly again. They absolutely could be separate issues, but listing here because the symptoms overlap / seem highly correlated.

@matheus23
Copy link
Member

I had some code lying around because I've been looking at similar issues.
I've looked at how many hashes the matheus23-drive DHSLink DAG has linked in total, using this node script:

async function dagHashes(ipfs, dagHash) {
    let hashes = new Set()
    hashes.add(dagHash)

    async function lookup(hash) {
        const links = await ipfs.object.links(hash, { timeout: 10000 })

        for (const link of links) {
            const h = link.Hash.toString()
            if (hashes.has(h)) {
                return;
            }
            console.log("Found hash", h)
            hashes.add(h)
            await lookup(h)
        }
    }

    await lookup(dagHash)

    return hashes
}

async function test4() {
    const ipfs = await Ipfs.create()
    await ipfs.swarm.connect("/dns4/production-ipfs-cluster-eu-north-1-node0.runfission.com/tcp/4001/p2p/12D3KooWDTUTdVJfW7Rwb6kKhceEwevTatPXnavPwkfZp2A6r1Fn")
    console.log("connected")

    let hashes = await dagHashes(ipfs, "bafybeien345pu3ygj7h35icegiydak6havc3ikfg7lma2zauisbh4mg5ty")
    console.log("Hashes", hashes.size)
}

Output contains:

Hashes 6521

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants