fix: performance improvements #107

jacobheun · 2019-04-24T09:36:52Z

This PR is aimed at improving performance of the dht.

Things being reviewed:

Timeouts of individual RPC calls for queries
Starting queries with k (20) peers instead of α (3), per S/Kademlia

Refs:

dirkmc · 2019-04-24T10:37:18Z

src/index.js

@@ -321,7 +321,7 @@ class KadDHT extends EventEmitter {
      waterfall([
        (cb) => utils.convertBuffer(key, cb),
        (id, cb) => {
-          const rtp = this.routingTable.closestPeers(id, c.ALPHA)


Seeing as this code is the same for every query, maybe it should be part of the Query itself?

src/constants.js

jacobheun · 2019-04-24T11:09:57Z

Here are some metrics for provide against the network. All calls are providing the same key and are initiated after random walk has run (with a 10second timeout). They also leverage the updated Query starting with k instead of α peers.

RPC Timeout	Total Query Time	Peers Queried	Errors	Failure Rate
1s	33.1s	409	357	87.3%
2.5s	96.5s	750	674	89.8%
5s	111.95s	717	528	73.6%
10s*	161.7s	420	265	63.1%
20s	114.9s	278	114	41.0%
30s	83.1s	193	45	23.3 %
30s	112.8s	257	62	24.1%
60s	125.7s	200	44	22.0%

While times are still high, this is a significant improvement over starting with α peers, as that had times of over 5 minutes.

Notable metrics

Low timeouts lead to high error rates which result in us querying a lot of extra peers.
The margin of difference in failure rates between 30s and 1minute appears to be low, which would indicate that if peers don't respond within 30s, they're not likely to respond within 1min.

* I believe the spike in time for 10s was due to a cpu issue on that run.

dirkmc · 2019-04-24T12:26:21Z

Very interesting findings!

I'm surprised that so many peers would take 30s to respond, do you have any idea why that might be? It shouldn't take long to get the closest peers from the k-table, so I assume that must be network latency, right?

jacobheun · 2019-04-24T13:01:22Z

I'm surprised that so many peers would take 30s to respond, do you have any idea why that might be? It shouldn't take long to get the closest peers from the k-table, so I assume that must be network latency, right?

I'm going to run some more tests. I'm thinking this may be the actual connect that's taking so long, as I was previously using timeout on the entire queryFunc instead of just the RPC call. Switch will timeout the dial after 10s by default, so we really shouldn't see long RPC calls.

vasco-santos · 2019-04-24T16:27:48Z

Great work with those metrics @jacobheun This way, we have a more concrete way of defining a proper timeout value for the queries

jacobheun · 2019-04-24T18:19:31Z

I started seeing some queries running for 10+ minutes (I ended up cancelling them). The error rate of the last was only 24.23%, but it had queried 1077 peers, which is insane.

I was also surprised to see that when I went to reprovide the same id, without restarting my node, the query times didn't really improve. I expected shorter query times since we should have the closest peers from the last query in our routing table.

I'm going to look at adding a more verbose test suite around getClosestPeers to see if I can find any issues we haven't caught yet.

dirkmc · 2019-04-25T00:08:13Z

In theory it should stop once there are no more closer peers to be queried than the K closest it already knows about, so it's very unlikely it would get to > 1,000 peers if that check is working correctly. I wonder if there's a bug there. Are you able to repro that test? If so it would be great to examine the DHT keys of the peers that it queries, and the peers waiting to be queried in each queue to ensure that it's examining them in the correct order.

jacobheun · 2019-04-25T08:06:17Z

I ran into the issue a few times on the network but didn't get dump of the queries, I'll be working on a test suite today to try to reproduce before I do any further tweaks.

One issue I see is Run.continueQuerying. Currently, when a worker is going to start the next item in its path, it calls Run.continueQuerying and checks to see if there are any closer peers in every worker. We should be just checking for closer peers in that path. Otherwise, what can happen is that a path could have only further peers in its queue than what has already been queried. If any path has a closer peer, we'll just keep querying the further peers, which could lead to us perpetually increasing the radius of the query if other paths find any closer peers. The paths are all ending at the same time when they should really be finishing as soon as continueQuerying returns false for them, and them alone.

test/query/index.spec.js

jacobheun · 2019-04-25T18:50:28Z

src/query/workerQueue.js

+
+      // If we've queried enough peers, stop the queue
+      if (!continueQuerying) {
+        this.stop()


Adding this.stop here improves the speed reliability quite a bit. Total query times are usually around or under 80seconds. While this isn't great, it's better than it currently is. The side effect of this, is that we might lose closer peers from the other, parallel, responses in this queue since it runs at a concurrency of 3.

I lowered the dial timeout (to 5s) on Switch for the js-ipfs instance I am running locally when testing against the live network. This also contributed to the improved speed.

One thing the Kademlia papers aren't particularly clear on (I may have missed it) is the behavior of the Disjoint Paths in terms of completing the query. Currently, we wait for all paths to finish before the query is finished. Should we? The more paths we allow to finish the closer we get to the actual closest peers, but the query takes significantly longer. If we stop after a single Disjoint Path has completed, the query time is dramatically reduced, but it's also less accurate. Long running peers might mitigated the accuracy problem as its routing table improves.

The big problem I am seeing now is slow/dead peers. My assumption is that the slow paths are hitting nodes that either aren't pruning their routing tables effectively, or the network is just that much in flux (nodes hitting the network and then going offline regularly). To remediate this we're either going to have to get really aggressive with the time we're allowing for connections and queries, or we'll have to be sloppier with what constitutes "closet". If we want to tighten down timeouts more, I think we'll want to add timeout options to Switch.dial. I've been changing this globally for testing, but DHT expectations vs global dial expectations are probably different.

A note: I played around with a concurrency of 6 instead of 3, but I actually observed a negative impact on query times.

@dirkmc @vasco-santos thoughts? I'm tempted to try things with the "sloppy" and fast approach and see how fetching performs on the network with that.

Excellent work @jacobheun, these are solid improvements, taking orders of magnitude off the query time.

I agree that we should add timeout options to Switch.dial(). I think the main reason that we see so many slow/dead peers is because nodes that are behind NATs are advertising their peer ids to the network. I believe this is the main problem in the go DHT and a strong impetus for using hole punching techniques. Ideally nodes would have a mechanism to be able to check if they're behind a NAT before advertising their peer ids, and only do so if they are reachable through a relay.

I believe it's correct to wait for all disjoint paths to finish before returning. A streaming API should mitigate this problem.

I would be interested in seeing how well the sloppy approach works, but it sounds like it would need to be run in a simulated network to understand how it behaves in the wild (unless there is already some research out there?).

Something I realized, which is likely attributing to some of the workers getting "blocked", are peer addresses. If we encounter a peer that is offline, and/or has a lot of bad addresses associated with it, the Switch timeout applies to per dial attempt, not connection. If we used the default dial timeout of 10seconds, and the peer had 20 unique addresses, it could take us 200 seconds to go through all of the addresses. We don't currently do parallel dials in the Switch on the same transport, as we can't yet abort dials properly if a connection succeeds. We will probably need to mitigate this with a temporary timeout on the entire connection attempt until the async iterators work is completed for transports, which will include abort logic so we can go back to using parallel dials.

because nodes that are behind NATs are advertising their peer ids to the network

Agreed, the NAT manager is on my list for this quarter, I may look at reprioritizing things a bit as this will help with performance in general.

I will look at adding an option to timeout the entire Switch.dial. I think this should help lower the high end of the queries.

If we go with the sloppy worker stop approach, and allow all disjoint paths to finish, we should still see reasonably close results. From https://github.com/libp2p/research-dht/issues/6, we could look at some other techniques to supplement the sloppy approach:

On a successful get, we re-publish the value to nodes on the lookup path. Subsequent gets to hot keys will find the result without. This is part of the original Kademlia algorithm that doesn't appear to be implemented.
Sloppy provides. Analogous to re-publishing along the lookup path on get, provide doesn't need to go all the way to the "correct" node. If there are a lot of providers for a given key, new providers should be added to nodes on the lookup path. This is one of the fundamental innovations in Coral[3].

Total query times are usually around or under 80seconds

Not ideal, but a great improvement already! 👏

If we want to tighten down timeouts more, I think we'll want to add timeout options to Switch.dial.

I believe this makes perfect sense for the DHT dials. As we will end up connecting with so many peers in the wild, the probability of failure will be super high and we should get to the next peer fast. So, I am totally in favor of this.

Regarding the disjoint paths, I also agree with @dirkmc that ideally, we should wait for all of them. However, we can make some experiments with that change, but I think it should be in a new PR, after this one.

The sloppy approach can also be interesting to experiment, do you prefer to continue it here?

I added the timeout to the query function. It seemed unnecessary to go about changing the Switch dial interface at this point. Once we have parallel dial aborts working, that should allow is to potentially remove the timeout here. During the Switch async refactor it might make sense to add the option to override timeouts per dial then.

The sloppy approach can also be interesting to experiment, do you prefer to continue it here?

Right now this is sloppy. We're still running all disjoint paths, but we let workers end earlier than they might normally. One of the new tests validates the possibility of missing closer peers.

I want to look at adding a basic simulation test that we can look at running regularly when we attempt performance updates. But other than that, I think this is in a state worth trying out, and the next improvements will likely come from Switch and NAT improvements.

jacobheun · 2019-04-26T11:25:19Z

src/index.js

        }, cb)
      }
-    ], (err) => callback(err))
+    ], (err) => {
+      if (errors.length) {


This should help mitigate #105. While the provide will still technically fail, we'll at least have attempted to provide to all peers.

jacobheun · 2019-04-26T17:56:17Z

I put together a basic simulation script in the tests folder that can be run via npm run sim. It simulates network calls with simple latency timeouts. You can see the constants in the file to get a better idea of the options.

Here's sample output for a 10k node network assuming 30% offline/super slow peers

Starting setup...
Total Nodes=10000, Dead Nodes=3000, Max Siblings per Peer=500
Starting 3 runs...
Found 14 of the top 20 peers in 51375 ms
Found 12 of the top 20 peers in 44222 ms
Found 13 of the top 20 peers in 65698 ms

I pulled the sim code into the master branch to see the difference, it's significant and looks inline with query times I was seeing on the live network.

Starting setup...
Total Nodes=10000, Dead Nodes=3000, Max Siblings per Peer=500
Starting 3 runs...
Found 11 of the top 20 peers in 555810 ms
Found 12 of the top 20 peers in 584240 ms
Found 12 of the top 20 peers in 383383 ms

dirkmc · 2019-04-27T01:32:29Z

Wow great performance improvements! It's very useful to have a simulator as well.

Something else that may help with a sloppy approach would be to implement libp2p/go-libp2p-kad-dht#323

jacobheun · 2019-04-27T10:51:12Z

I added the ability to override concurrency via dht options as this will be nice for people to be able to configure. I also bumped up the default to 6. With a concurrency of 6 and the sloppy worker behavior, the query times were a bit more consistent and lower. Also, the results of the queries were more consistent. I updated the simulation to do a final check of common closest peers from each run.

Concurrency of 6, "sloppy" workers off

Starting setup...
Total Nodes=10000, Dead Nodes=3000, Max Siblings per Peer=500
Starting 3 runs with concurrency 6...
Found 13 of the top 20 peers in 90033 ms
Found 13 of the top 20 peers in 75473 ms
Found 13 of the top 20 peers in 99298 ms
All runs found 12 common peers

Concurrency of 6, "sloppy" workers on

Starting setup...
Total Nodes=10000, Dead Nodes=3000, Max Siblings per Peer=500
Starting 3 runs with concurrency 6...
Found 14 of the top 20 peers in 49932 ms
Found 14 of the top 20 peers in 54169 ms
Found 14 of the top 20 peers in 51878 ms
All runs found 14 common peers

While there is randomness to the simulation network between runs (perhaps an enhancement would be to generate networks prior to doing a run so you could leverage the same network to test against), the results across runs are fairly consistent.

I want to look at CPU usage with these settings against js-ipfs, but I think these will make a big impact until we can resolve the network latency issues and dead/slow peer timeouts, which are now the major inhibitors to query times.

Something else that may help with a sloppy approach would be to implement libp2p/go-libp2p-kad-dht#323

Republish will be really important, even without us being sloppy. Offline nodes that are in peer tables will make it improbable that we'll get the actual 20 closest peers, and the actual peers we get are going to vary a lot as the network changes.

dirkmc · 2019-04-29T03:17:42Z

Nice 👍 The timing looks a lot better. Which is the piece of code that implements "sloppiness"?

jacobheun · 2019-04-29T10:03:15Z

It's the added stop at:

js-libp2p-kad-dht/src/query/workerQueue.js

Lines 132 to 136 in a1c9890

    
           // No peer we're querying is closer stop the queue 
        
           // This will cause queries that may potentially result in 
        
           // closer nodes to be ended, but it reduces overall query time 
        
           if (!continueQuerying) { 
        
             this.stop()

.

dirkmc · 2019-04-29T12:18:16Z

This is looking pretty good, let me know when you'd like a code review

jacobheun · 2019-04-29T13:48:28Z

A review would be good. The performance tests I'll do this week on js-ipfs should change much aside from adjusting config options.

dirkmc

A couple of nits, but otherwise LGTM 👍

src/index.js

src/query/run.js

Co-Authored-By: jacobheun <jacobheun@gmail.com>

vasco-santos

Great work Jacob! 👏

The changes look great and the queue codebase is way better structured now. Also the performance improvements are super good, as well as the simulator!

Just detected an option missing in docs.

vasco-santos · 2019-05-07T17:06:38Z

src/index.js

    /**
     * Number of closest peers to return on kBucket search, default 20
     *
     * @type {number}
     */
-    this.ncp = options.ncp || c.K
+    this.ncp = options.ncp || this.kBucketSize


Can you add docs to options.ncp ?

So, looking through the code we actually only use dht.ncp in 1 spot. I kind of think we should just get rid of it altogether and replace that occurrence with dht.kBucketSize. While it could potentially provide some finer grained control of the number of results we're returning, it's not doing that and I think it adds more confusion than it's worth right now. Thoughts?

Sounds good!

ncp is now gone in favor of kBucketSize

dirkmc · 2019-05-08T04:49:13Z

@jacobheun with these changes do you think we can remove the shallow getClosestPeers() behaviour on PUT?

jacobheun · 2019-05-08T08:50:09Z

@jacobheun with these changes do you think we can remove the shallow getClosestPeers() behaviour on PUT?

We should be able to. It's still going to be relatively slow until we can improve the connection latency. I think we can do some comparative tests against this and a follow up PR for removing shallow to see what the impacts are. Shallow is definitely going to be much less accurate.

vasco-santos

LGTM

vasco-santos · 2019-05-08T10:29:46Z

🚢 @0.4.14

jacobheun added 2 commits April 24, 2019 11:02

fix: add basic timeout to individual query func calls

12fb0d7

fix: per s/kademlia start with K peers, not Alpha

bf152e0

ghost assigned jacobheun Apr 24, 2019

ghost added the status/in-progress In progress label Apr 24, 2019

dirkmc reviewed Apr 24, 2019

View reviewed changes

jacobheun added 2 commits April 24, 2019 14:01

fix: properly use kBucketSize option when provided

81c26d9

fix: make disjoint paths a function of kBucketSize/2

57133a3

jacobheun added 3 commits April 25, 2019 15:01

refactor: split up query classes for easier testing

6ce1e1b

test: add test to verify paths stop

a1f9e2b

chore: fix linting

e49045d

jacobheun commented Apr 25, 2019

View reviewed changes

test/query/index.spec.js Outdated Show resolved Hide resolved

jacobheun added 5 commits April 25, 2019 17:33

fix: make alpha 3 again, reduce rpc message timeout

2d3bacd

refactor: remove unneeded peerBook.put and update jsdocs

81a558c

fix: stop the worker if it is done querying

75ec755

test: add spy on continueQuering to its call count

e21967f

docs: update jsdocs on utils

909f44c

jacobheun commented Apr 25, 2019

View reviewed changes

jacobheun added 3 commits April 26, 2019 10:43

fix: make queries quicker but sloppier

a0eab2c

fix: dont error provide until all peers have been attempted

6e7ba6c

fix: add timeout to query func

d9831bb

jacobheun commented Apr 26, 2019

View reviewed changes

test: add a simulation

61a8e26

jacobheun force-pushed the fix/performance branch from cc282c1 to 61a8e26 Compare April 26, 2019 17:48

jacobheun added 2 commits April 27, 2019 12:42

feat: allow configuration of alpha concurrency

1e63a30

test(sim): add common peer check

a1c9890

jacobheun marked this pull request as ready for review April 29, 2019 13:47

jacobheun requested review from dirkmc and vasco-santos and removed request for dirkmc April 29, 2019 13:47

jacobheun changed the title ~~[WIP] fix: performance improvements~~ fix: performance improvements Apr 29, 2019

dirkmc approved these changes Apr 29, 2019

View reviewed changes

src/index.js Outdated Show resolved Hide resolved

src/query/run.js Outdated Show resolved Hide resolved

fix: apply suggestions from code review

fa203c7

Co-Authored-By: jacobheun <jacobheun@gmail.com>

kumavis mentioned this pull request Apr 29, 2019

async refactor - refactor internal methods as async #108

Merged

vasco-santos suggested changes May 7, 2019

View reviewed changes

fix: remove ncp in favor of kBucketSize

9c886f1

vasco-santos approved these changes May 8, 2019

View reviewed changes

vasco-santos merged commit ddf80fe into master May 8, 2019

ghost removed the status/in-progress In progress label May 8, 2019

vasco-santos deleted the fix/performance branch May 8, 2019 10:18

jacobheun mentioned this pull request May 23, 2019

Call for help - Limitations with js-ipfs ipfs/js-ipfs#2093

Closed

dirkmc mentioned this pull request Jun 6, 2019

improve query performance by limiting query width to KValue peers libp2p/go-libp2p-kad-dht#291

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: performance improvements #107

fix: performance improvements #107

jacobheun commented Apr 24, 2019 •

edited

Loading

dirkmc Apr 24, 2019

jacobheun Apr 24, 2019

jacobheun commented Apr 24, 2019 •

edited

Loading

dirkmc commented Apr 24, 2019

jacobheun commented Apr 24, 2019

vasco-santos commented Apr 24, 2019

jacobheun commented Apr 24, 2019

dirkmc commented Apr 25, 2019

jacobheun commented Apr 25, 2019

jacobheun Apr 25, 2019

dirkmc Apr 26, 2019

jacobheun Apr 26, 2019

jacobheun Apr 26, 2019

vasco-santos Apr 26, 2019

jacobheun Apr 26, 2019

jacobheun Apr 26, 2019

jacobheun commented Apr 26, 2019 •

edited

Loading

dirkmc commented Apr 27, 2019

jacobheun commented Apr 27, 2019

dirkmc commented Apr 29, 2019

jacobheun commented Apr 29, 2019

dirkmc commented Apr 29, 2019

jacobheun commented Apr 29, 2019

dirkmc left a comment

vasco-santos left a comment

vasco-santos May 7, 2019

jacobheun May 7, 2019

dirkmc May 8, 2019

vasco-santos May 8, 2019

jacobheun May 8, 2019

dirkmc commented May 8, 2019

jacobheun commented May 8, 2019

vasco-santos left a comment

vasco-santos commented May 8, 2019

fix: performance improvements #107

fix: performance improvements #107

Conversation

jacobheun commented Apr 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobheun commented Apr 24, 2019 • edited Loading

dirkmc commented Apr 24, 2019

jacobheun commented Apr 24, 2019

vasco-santos commented Apr 24, 2019

jacobheun commented Apr 24, 2019

dirkmc commented Apr 25, 2019

jacobheun commented Apr 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobheun commented Apr 26, 2019 • edited Loading

dirkmc commented Apr 27, 2019

jacobheun commented Apr 27, 2019

dirkmc commented Apr 29, 2019

jacobheun commented Apr 29, 2019

dirkmc commented Apr 29, 2019

jacobheun commented Apr 29, 2019

dirkmc left a comment

Choose a reason for hiding this comment

vasco-santos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkmc commented May 8, 2019

jacobheun commented May 8, 2019

vasco-santos left a comment

Choose a reason for hiding this comment

vasco-santos commented May 8, 2019

jacobheun commented Apr 24, 2019 •

edited

Loading

jacobheun commented Apr 24, 2019 •

edited

Loading

jacobheun commented Apr 26, 2019 •

edited

Loading