WNFS data root updates can take > 60s with some accounts #489

expede · 2021-04-11T22:45:16Z

Reproduced By

Brooke
Philipp

Environment

OSes: Arch, macOS
Browsers: Firefox, Safari

Ruled Out

expede · 2021-04-11T22:51:04Z

Issue occurs across all browsers and OSes, but only some accounts. It's consistent with those ones.

This looks like something happening on the server-side, though we still need to validate that the request is actually sent and not blocked by a service worker or similar.

From conversation, we had a hypothesis that it could be the UCAN validation when it's deeply nested. This doesn't appear to be the case from benchmarks, so we can rule that out:

    RSA-based UCAN checks
      One   Level:     OK (1.77s)
         53 μs ± 4.1 μs
      Two   Levels:    OK (4.70s)
         72 μs ± 3.8 μs
      Three Levels:    OK (3.93s)
        118 μs ± 3.8 μs

matheus23 · 2021-04-12T13:05:33Z

Hypothesis: This is related to general file system size.

I have this hunch, because there should be lots of history entries, due to lots of publish calls (which were automatically called in flatmate when any change occurred).

Not sure how to test this best. Here are a couple of ideas:

Create a new account and spam publish calls and changes and log the time PATCH requests take. Hypothesis should be confirmed when we see a steady increase in PATCH request times.
Log ipfs request performance server-side
Copy over the matheus23-drive file system (by grabbing the DNSLink CID) & read key and test another account to see if it still happens. Reasoning for this is: This is very likely something happening on the server. And the only account we can reproduce this with at the moment lives on production and we want to reproduce this somewhere else ideally. Also it double-checks that this is nothing related to the particular account.
Try pulling the account data and looking at the actual size? We have no idea how big it actually is: IPFS only gives us the inflated 533 Petabytes number.
Override the user's file system with an empty one (Ideally we keep a backup of the old one locally)

This looks like something happening on the server-side, though we still need to validate that the request is actually sent and not blocked by a service worker or similar.

I just tested that quickly by bypassing all server-worker related things in chromium, no changes. It seems unrelated to service workers.

However, it could definitely still be an issue client-side! It could be js-ipfs on the client behaving weirdly, causing the changes to be uploaded really slowly to the server. If js-ipfs were uploading changes very slowly, then the server would wait for CIDs from the client to be finally replicated on the server, which would stall the PATCH request.

matheus23 · 2021-04-12T13:31:11Z

Try pulling the account data and looking at the actual size? We have no idea how big it actually is: IPFS only gives us the inflated 533 Petabytes number.

Turns out this is actually fairly easy:

# My laptop's repo size before:
$ ipfs repo stat
NumObjects: 2191
RepoSize:   88415581
StorageMax: 10000000000
RepoPath:   /home/philipp/.ipfs
Version:    fs-repo@10
# Just fetch my DNSLink:
$ ipfs swarm connect /dns4/node.runfission.com/tcp/4001/ipfs/QmVLEz2SxoNiFnuyLpbXsH6SvjPTrHNMU88vCQZyhgBzgw
connect QmVLEz2SxoNiFnuyLpbXsH6SvjPTrHNMU88vCQZyhgBzgw success
$ time ipfs pin add bafybeihtsvlsojwo6gi6ea5jo73uwaqla7rcbjmhniul5i7cxkle4c66fq
pinned bafybeihtsvlsojwo6gi6ea5jo73uwaqla7rcbjmhniul5i7cxkle4c66fq recursively

real	5m43,197s
user	0m0,416s
sys	0m0,078s
# Repo size after
$ ipfs repo stat
NumObjects: 29066
RepoSize:   487033005
StorageMax: 10000000000
RepoPath:   /home/philipp/.ipfs
Version:    fs-repo@10

So the change is 398,617424 Megabytes.

matheus23 · 2021-04-12T13:41:59Z

I've listed the history for the public and private file system:

list = await (await fs.get("private/")).history.list(10000)
publicList = await (await fs.get("public/")).history.list(10000)
list.length
// -> 244
publicList.length
// -> 1810

Listing the public history took quite a while!

I think I know how I got so many public history entries: I've uploaded lots of fission.apps via the dashboard. To do that, the dashboard first adds all files to be uploaded to the user's public file system one-by-one. This causes a history entry for every single file uploaded.

matheus23 · 2021-04-12T14:28:41Z

One more thing I tested: I changed the matheus23-drive WNFS and pinned the new DNSLink to test how fast updates would be:

$ time ipfs pin add bafybeiah26dwevhoogdaefmrpxi5o24e2nb7kz37rxjo63rh2jq7vfooii
pinned bafybeiah26dwevhoogdaefmrpxi5o24e2nb7kz37rxjo63rh2jq7vfooii recursively

real	0m3,986s
user	0m0,322s
sys	0m0,074s

Answer: Fast 🤷‍♂️

expede · 2021-04-13T01:03:43Z

@matheus23 knowing these numbers is awesome 👏 Thanks for designing an experiment & spot testing this!

5m42s ~343s for 398.6MB
So, depending on how you want to look at it: 1.16MB/s = 0.86s/MB

That seems actually good but also not bad, it's very roughly 10Mbps. (That said, my fiber connection is more like 280Mbps, so YMMV 😛 #fiberbrag)

I think I know how I got so many public history entries: I've uploaded lots of fission.apps via the dashboard. To do that, the dashboard first adds all files to be uploaded to the user's public file system one-by-one. This causes a history entry for every single file uploaded.

This isn't necessarily a bad default, but yes, grouping the files makes sense, and is the plan 👍

real 0m3,986s
user 0m0,322s
sys 0m0,074s

Answer: Fast 🤷‍♂️

Yeah, this is more in line with what I would expect: only syncing the changes.

I wonder if our hosted nodes are having a hard time fetching the data from the browser for some reason? Does upload get throttled? Do they have a hard time connecting to the browser's node? We can see the upload timings in the logs with streaming updates (we set them to be pretty granular while we sort out a few things).

expede · 2021-04-13T01:04:14Z

Listing the public history took quite a while!

What was the order of magnitude? Seconds? Minutes? Hours?

expede · 2021-04-13T01:59:54Z

$ ipfs swarm connect /dns4/node.runfission.com/tcp/4001/ipfs/QmVLEz2SxoNiFnuyLpbXsH6SvjPTrHNMU88vCQZyhgBzgw
connect QmVLEz2SxoNiFnuyLpbXsH6SvjPTrHNMU88vCQZyhgBzgw success
$ time ipfs pin add bafybeihtsvlsojwo6gi6ea5jo73uwaqla7rcbjmhniul5i7cxkle4c66fq

I ran this just to see. It's been going for... quite some time (maybe half an hour so far?). I'll post if/when it ends 😛 ipfs repo stat indicates that something is happening

expede · 2021-04-13T02:06:45Z

ipfs repo also agrees on ~393.5MB

Okay, so the same experiment, similar results for the diff. 52 minutes to download the ~400MB on my fancy fast internet (130kB/s... YIKES).

Fetching the updated version which contains the first took all of one second and some change, as expected. So it's fast on diffs ⚡

expede · 2021-04-13T02:41:15Z

More odd data: looked at CPU and network on AWS during those times. Essentially flat lines of low utilization.

matheus23 · 2021-04-13T16:05:46Z

More odd data: looked at CPU and network on AWS during those times. Essentially flat lines of low utilization.

This might be an indication that the issue is at the js-ipfs side, not the go-ipfs side on the server. I.e. all the "bitswapping" is being slow in the browser.

I talked through this with james today. We don't really see how there should be lots of CIDs requested on small file changes.

What was the order of magnitude? Seconds? Minutes? Hours?

Roughly a little more than 30 minutes, but I haven't timed it.

expede · 2021-04-13T16:16:11Z

cc @walkah we need to figure out why go-ipfs is running so slowly... sometimes? Can we tweak anything on AWS? Why does network out spike much higher sometimes? Lots of questions!

expede · 2021-04-16T16:46:16Z

Just referencing here in case it turns out to be related: ipfs/js-ipfs#2197

matheus23 · 2021-04-19T13:38:09Z

Just referencing here in case it turns out to be related

Yeah, during search I found that too, but it turns out it's a stale issue. There's a merged PR linked in it which landed in js-ipfs 0.42, and we're using js-ipfs 0.52.2.

expede · 2021-04-19T17:06:16Z

Another related instance from this weekend:

Attempted to upload a ~68MB video to both staging and production Drive. It never finished (waited a couple hours)
Refreshed the page, and uploaded a ~6.5MB still image, and it uploaded in a couple seconds
- Tested: able to view image in a different browser 👍
Tried the video again, and it never finished 👎

Is part of the bottleneck here the browser trying to load the large asset from the file system? I need to look closer.

I was able to reproduce it just now:

expede · 2021-04-21T23:59:35Z

The plot thickens. Via the CLI (i.e. not in browser, go-ipfs, sparse DAG structure, fiber connection), a FreeBSD iso (~1.3GB) added & uploaded in 8m13s 🤷‍♀️

matheus23 · 2021-04-22T11:42:44Z

Another related instance from this weekend:

I'm not quite sure this is the same issue. Is the video upload stuck on the PATCH request for the data root?

Just to be clear, this is not the originally affected account, right? And it also only happens with bigger changes, right? The original issue was already triggered by small changes, so I think it's a separate issue.

expede · 2021-04-22T18:06:31Z

I'm not quite sure this is the same issue. Is the video upload stuck on the PATCH request for the data root?

I was able to previously reproduce the speed issue with the above method. Now it's working quickly again. They absolutely could be separate issues, but listing here because the symptoms overlap / seem highly correlated.

matheus23 · 2021-04-23T19:50:29Z

I had some code lying around because I've been looking at similar issues.
I've looked at how many hashes the matheus23-drive DHSLink DAG has linked in total, using this node script:

async function dagHashes(ipfs, dagHash) {
    let hashes = new Set()
    hashes.add(dagHash)

    async function lookup(hash) {
        const links = await ipfs.object.links(hash, { timeout: 10000 })

        for (const link of links) {
            const h = link.Hash.toString()
            if (hashes.has(h)) {
                return;
            }
            console.log("Found hash", h)
            hashes.add(h)
            await lookup(h)
        }
    }

    await lookup(dagHash)

    return hashes
}

async function test4() {
    const ipfs = await Ipfs.create()
    await ipfs.swarm.connect("/dns4/production-ipfs-cluster-eu-north-1-node0.runfission.com/tcp/4001/p2p/12D3KooWDTUTdVJfW7Rwb6kKhceEwevTatPXnavPwkfZp2A6r1Fn")
    console.log("connected")

    let hashes = await dagHashes(ipfs, "bafybeien345pu3ygj7h35icegiydak6havc3ikfg7lma2zauisbh4mg5ty")
    console.log("Hashes", hashes.size)
}

Output contains:

Hashes 6521

expede mentioned this issue Apr 11, 2021

Add RSA-2048 / UCAN benchmarks #490

Merged

expede changed the title ~~WNFS data root updates can take > 60s with _some accounts._~~ WNFS data root updates can take > 60s with **some** accounts Apr 16, 2021

matheus23 mentioned this issue Apr 16, 2021

Benchmark filesystems with lots of changes fission-codes/benchmark#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WNFS data root updates can take > 60s with some accounts #489

WNFS data root updates can take > 60s with some accounts #489

expede commented Apr 11, 2021 •

edited

Loading

expede commented Apr 11, 2021

matheus23 commented Apr 12, 2021

matheus23 commented Apr 12, 2021

matheus23 commented Apr 12, 2021

matheus23 commented Apr 12, 2021

expede commented Apr 13, 2021 •

edited

Loading

expede commented Apr 13, 2021

expede commented Apr 13, 2021 •

edited

Loading

expede commented Apr 13, 2021 •

edited

Loading

expede commented Apr 13, 2021

matheus23 commented Apr 13, 2021

expede commented Apr 13, 2021 •

edited

Loading

expede commented Apr 16, 2021

matheus23 commented Apr 19, 2021

expede commented Apr 19, 2021 •

edited

Loading

expede commented Apr 21, 2021 •

edited

Loading

matheus23 commented Apr 22, 2021 •

edited

Loading

expede commented Apr 22, 2021 •

edited

Loading

matheus23 commented Apr 23, 2021

WNFS data root updates can take > 60s with **some** accounts #489

WNFS data root updates can take > 60s with **some** accounts #489

Comments

expede commented Apr 11, 2021 • edited Loading

Reproduced By

Environment

Ruled Out

expede commented Apr 11, 2021

matheus23 commented Apr 12, 2021

matheus23 commented Apr 12, 2021

matheus23 commented Apr 12, 2021

matheus23 commented Apr 12, 2021

expede commented Apr 13, 2021 • edited Loading

expede commented Apr 13, 2021

expede commented Apr 13, 2021 • edited Loading

expede commented Apr 13, 2021 • edited Loading

expede commented Apr 13, 2021

matheus23 commented Apr 13, 2021

expede commented Apr 13, 2021 • edited Loading

expede commented Apr 16, 2021

matheus23 commented Apr 19, 2021

expede commented Apr 19, 2021 • edited Loading

expede commented Apr 21, 2021 • edited Loading

matheus23 commented Apr 22, 2021 • edited Loading

expede commented Apr 22, 2021 • edited Loading

matheus23 commented Apr 23, 2021

WNFS data root updates can take > 60s with some accounts #489

WNFS data root updates can take > 60s with some accounts #489

expede commented Apr 11, 2021 •

edited

Loading

expede commented Apr 13, 2021 •

edited

Loading

expede commented Apr 13, 2021 •

edited

Loading

expede commented Apr 13, 2021 •

edited

Loading

expede commented Apr 13, 2021 •

edited

Loading

expede commented Apr 19, 2021 •

edited

Loading

expede commented Apr 21, 2021 •

edited

Loading

matheus23 commented Apr 22, 2021 •

edited

Loading

expede commented Apr 22, 2021 •

edited

Loading