Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

fix: performance improvements #107

Merged
merged 20 commits into from
May 8, 2019
Merged

fix: performance improvements #107

merged 20 commits into from
May 8, 2019

Conversation

jacobheun
Copy link
Contributor

@jacobheun jacobheun commented Apr 24, 2019

This PR is aimed at improving performance of the dht.

Things being reviewed:

  • Timeouts of individual RPC calls for queries
  • Starting queries with k (20) peers instead of α (3), per S/Kademlia

Refs:

@ghost ghost assigned jacobheun Apr 24, 2019
@ghost ghost added the status/in-progress In progress label Apr 24, 2019
@@ -321,7 +321,7 @@ class KadDHT extends EventEmitter {
waterfall([
(cb) => utils.convertBuffer(key, cb),
(id, cb) => {
const rtp = this.routingTable.closestPeers(id, c.ALPHA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as this code is the same for every query, maybe it should be part of the Query itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

src/constants.js Outdated Show resolved Hide resolved
@jacobheun
Copy link
Contributor Author

jacobheun commented Apr 24, 2019

Here are some metrics for provide against the network. All calls are providing the same key and are initiated after random walk has run (with a 10second timeout). They also leverage the updated Query starting with k instead of α peers.

RPC Timeout Total Query Time Peers Queried Errors Failure Rate
1s 33.1s 409 357 87.3%
2.5s 96.5s 750 674 89.8%
5s 111.95s 717 528 73.6%
10s* 161.7s 420 265 63.1%
20s 114.9s 278 114 41.0%
30s 83.1s 193 45 23.3 %
30s 112.8s 257 62 24.1%
60s 125.7s 200 44 22.0%

While times are still high, this is a significant improvement over starting with α peers, as that had times of over 5 minutes.

Notable metrics

  • Low timeouts lead to high error rates which result in us querying a lot of extra peers.
  • The margin of difference in failure rates between 30s and 1minute appears to be low, which would indicate that if peers don't respond within 30s, they're not likely to respond within 1min.

* I believe the spike in time for 10s was due to a cpu issue on that run.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 24, 2019

Very interesting findings!

I'm surprised that so many peers would take 30s to respond, do you have any idea why that might be? It shouldn't take long to get the closest peers from the k-table, so I assume that must be network latency, right?

@jacobheun
Copy link
Contributor Author

I'm surprised that so many peers would take 30s to respond, do you have any idea why that might be? It shouldn't take long to get the closest peers from the k-table, so I assume that must be network latency, right?

I'm going to run some more tests. I'm thinking this may be the actual connect that's taking so long, as I was previously using timeout on the entire queryFunc instead of just the RPC call. Switch will timeout the dial after 10s by default, so we really shouldn't see long RPC calls.

@vasco-santos
Copy link
Member

Great work with those metrics @jacobheun This way, we have a more concrete way of defining a proper timeout value for the queries

@jacobheun
Copy link
Contributor Author

I started seeing some queries running for 10+ minutes (I ended up cancelling them). The error rate of the last was only 24.23%, but it had queried 1077 peers, which is insane.

I was also surprised to see that when I went to reprovide the same id, without restarting my node, the query times didn't really improve. I expected shorter query times since we should have the closest peers from the last query in our routing table.

I'm going to look at adding a more verbose test suite around getClosestPeers to see if I can find any issues we haven't caught yet.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 25, 2019

In theory it should stop once there are no more closer peers to be queried than the K closest it already knows about, so it's very unlikely it would get to > 1,000 peers if that check is working correctly. I wonder if there's a bug there. Are you able to repro that test? If so it would be great to examine the DHT keys of the peers that it queries, and the peers waiting to be queried in each queue to ensure that it's examining them in the correct order.

@jacobheun
Copy link
Contributor Author

I ran into the issue a few times on the network but didn't get dump of the queries, I'll be working on a test suite today to try to reproduce before I do any further tweaks.

One issue I see is Run.continueQuerying. Currently, when a worker is going to start the next item in its path, it calls Run.continueQuerying and checks to see if there are any closer peers in every worker. We should be just checking for closer peers in that path. Otherwise, what can happen is that a path could have only further peers in its queue than what has already been queried. If any path has a closer peer, we'll just keep querying the further peers, which could lead to us perpetually increasing the radius of the query if other paths find any closer peers. The paths are all ending at the same time when they should really be finishing as soon as continueQuerying returns false for them, and them alone.

test/query/index.spec.js Outdated Show resolved Hide resolved

// If we've queried enough peers, stop the queue
if (!continueQuerying) {
this.stop()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this.stop here improves the speed reliability quite a bit. Total query times are usually around or under 80seconds. While this isn't great, it's better than it currently is. The side effect of this, is that we might lose closer peers from the other, parallel, responses in this queue since it runs at a concurrency of 3.

I lowered the dial timeout (to 5s) on Switch for the js-ipfs instance I am running locally when testing against the live network. This also contributed to the improved speed.

One thing the Kademlia papers aren't particularly clear on (I may have missed it) is the behavior of the Disjoint Paths in terms of completing the query. Currently, we wait for all paths to finish before the query is finished. Should we? The more paths we allow to finish the closer we get to the actual closest peers, but the query takes significantly longer. If we stop after a single Disjoint Path has completed, the query time is dramatically reduced, but it's also less accurate. Long running peers might mitigated the accuracy problem as its routing table improves.

The big problem I am seeing now is slow/dead peers. My assumption is that the slow paths are hitting nodes that either aren't pruning their routing tables effectively, or the network is just that much in flux (nodes hitting the network and then going offline regularly). To remediate this we're either going to have to get really aggressive with the time we're allowing for connections and queries, or we'll have to be sloppier with what constitutes "closet". If we want to tighten down timeouts more, I think we'll want to add timeout options to Switch.dial. I've been changing this globally for testing, but DHT expectations vs global dial expectations are probably different.

A note: I played around with a concurrency of 6 instead of 3, but I actually observed a negative impact on query times.

@dirkmc @vasco-santos thoughts? I'm tempted to try things with the "sloppy" and fast approach and see how fetching performs on the network with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work @jacobheun, these are solid improvements, taking orders of magnitude off the query time.

I agree that we should add timeout options to Switch.dial(). I think the main reason that we see so many slow/dead peers is because nodes that are behind NATs are advertising their peer ids to the network. I believe this is the main problem in the go DHT and a strong impetus for using hole punching techniques. Ideally nodes would have a mechanism to be able to check if they're behind a NAT before advertising their peer ids, and only do so if they are reachable through a relay.

I believe it's correct to wait for all disjoint paths to finish before returning. A streaming API should mitigate this problem.

I would be interested in seeing how well the sloppy approach works, but it sounds like it would need to be run in a simulated network to understand how it behaves in the wild (unless there is already some research out there?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I realized, which is likely attributing to some of the workers getting "blocked", are peer addresses. If we encounter a peer that is offline, and/or has a lot of bad addresses associated with it, the Switch timeout applies to per dial attempt, not connection. If we used the default dial timeout of 10seconds, and the peer had 20 unique addresses, it could take us 200 seconds to go through all of the addresses. We don't currently do parallel dials in the Switch on the same transport, as we can't yet abort dials properly if a connection succeeds. We will probably need to mitigate this with a temporary timeout on the entire connection attempt until the async iterators work is completed for transports, which will include abort logic so we can go back to using parallel dials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because nodes that are behind NATs are advertising their peer ids to the network

Agreed, the NAT manager is on my list for this quarter, I may look at reprioritizing things a bit as this will help with performance in general.

I will look at adding an option to timeout the entire Switch.dial. I think this should help lower the high end of the queries.

If we go with the sloppy worker stop approach, and allow all disjoint paths to finish, we should still see reasonably close results. From https://github.com/libp2p/research-dht/issues/6, we could look at some other techniques to supplement the sloppy approach:

On a successful get, we re-publish the value to nodes on the lookup path. Subsequent gets to hot keys will find the result without. This is part of the original Kademlia algorithm that doesn't appear to be implemented.
Sloppy provides. Analogous to re-publishing along the lookup path on get, provide doesn't need to go all the way to the "correct" node. If there are a lot of providers for a given key, new providers should be added to nodes on the lookup path. This is one of the fundamental innovations in Coral[3].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total query times are usually around or under 80seconds

Not ideal, but a great improvement already! 👏

If we want to tighten down timeouts more, I think we'll want to add timeout options to Switch.dial.

I believe this makes perfect sense for the DHT dials. As we will end up connecting with so many peers in the wild, the probability of failure will be super high and we should get to the next peer fast. So, I am totally in favor of this.

Regarding the disjoint paths, I also agree with @dirkmc that ideally, we should wait for all of them. However, we can make some experiments with that change, but I think it should be in a new PR, after this one.

The sloppy approach can also be interesting to experiment, do you prefer to continue it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the timeout to the query function. It seemed unnecessary to go about changing the Switch dial interface at this point. Once we have parallel dial aborts working, that should allow is to potentially remove the timeout here. During the Switch async refactor it might make sense to add the option to override timeouts per dial then.

The sloppy approach can also be interesting to experiment, do you prefer to continue it here?

Right now this is sloppy. We're still running all disjoint paths, but we let workers end earlier than they might normally. One of the new tests validates the possibility of missing closer peers.

I want to look at adding a basic simulation test that we can look at running regularly when we attempt performance updates. But other than that, I think this is in a state worth trying out, and the next improvements will likely come from Switch and NAT improvements.

}, cb)
}
], (err) => callback(err))
], (err) => {
if (errors.length) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should help mitigate #105. While the provide will still technically fail, we'll at least have attempted to provide to all peers.

@jacobheun
Copy link
Contributor Author

jacobheun commented Apr 26, 2019

I put together a basic simulation script in the tests folder that can be run via npm run sim. It simulates network calls with simple latency timeouts. You can see the constants in the file to get a better idea of the options.

Here's sample output for a 10k node network assuming 30% offline/super slow peers

Starting setup...
Total Nodes=10000, Dead Nodes=3000, Max Siblings per Peer=500
Starting 3 runs...
Found 14 of the top 20 peers in 51375 ms
Found 12 of the top 20 peers in 44222 ms
Found 13 of the top 20 peers in 65698 ms

I pulled the sim code into the master branch to see the difference, it's significant and looks inline with query times I was seeing on the live network.

Starting setup...
Total Nodes=10000, Dead Nodes=3000, Max Siblings per Peer=500
Starting 3 runs...
Found 11 of the top 20 peers in 555810 ms
Found 12 of the top 20 peers in 584240 ms
Found 12 of the top 20 peers in 383383 ms

@dirkmc
Copy link
Contributor

dirkmc commented Apr 27, 2019

Wow great performance improvements! It's very useful to have a simulator as well.

Something else that may help with a sloppy approach would be to implement libp2p/go-libp2p-kad-dht#323

@jacobheun
Copy link
Contributor Author

I added the ability to override concurrency via dht options as this will be nice for people to be able to configure. I also bumped up the default to 6. With a concurrency of 6 and the sloppy worker behavior, the query times were a bit more consistent and lower. Also, the results of the queries were more consistent. I updated the simulation to do a final check of common closest peers from each run.

Concurrency of 6, "sloppy" workers off

Starting setup...
Total Nodes=10000, Dead Nodes=3000, Max Siblings per Peer=500
Starting 3 runs with concurrency 6...
Found 13 of the top 20 peers in 90033 ms
Found 13 of the top 20 peers in 75473 ms
Found 13 of the top 20 peers in 99298 ms
All runs found 12 common peers

Concurrency of 6, "sloppy" workers on

Starting setup...
Total Nodes=10000, Dead Nodes=3000, Max Siblings per Peer=500
Starting 3 runs with concurrency 6...
Found 14 of the top 20 peers in 49932 ms
Found 14 of the top 20 peers in 54169 ms
Found 14 of the top 20 peers in 51878 ms
All runs found 14 common peers

While there is randomness to the simulation network between runs (perhaps an enhancement would be to generate networks prior to doing a run so you could leverage the same network to test against), the results across runs are fairly consistent.

I want to look at CPU usage with these settings against js-ipfs, but I think these will make a big impact until we can resolve the network latency issues and dead/slow peer timeouts, which are now the major inhibitors to query times.

Something else that may help with a sloppy approach would be to implement libp2p/go-libp2p-kad-dht#323

Republish will be really important, even without us being sloppy. Offline nodes that are in peer tables will make it improbable that we'll get the actual 20 closest peers, and the actual peers we get are going to vary a lot as the network changes.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 29, 2019

Nice 👍 The timing looks a lot better. Which is the piece of code that implements "sloppiness"?

@jacobheun
Copy link
Contributor Author

It's the added stop at:

// No peer we're querying is closer stop the queue
// This will cause queries that may potentially result in
// closer nodes to be ended, but it reduces overall query time
if (!continueQuerying) {
this.stop()
.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 29, 2019

This is looking pretty good, let me know when you'd like a code review

@jacobheun jacobheun marked this pull request as ready for review April 29, 2019 13:47
@jacobheun jacobheun requested review from dirkmc and vasco-santos and removed request for dirkmc April 29, 2019 13:47
@jacobheun
Copy link
Contributor Author

A review would be good. The performance tests I'll do this week on js-ipfs should change much aside from adjusting config options.

@jacobheun jacobheun changed the title [WIP] fix: performance improvements fix: performance improvements Apr 29, 2019
Copy link
Contributor

@dirkmc dirkmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits, but otherwise LGTM 👍

src/index.js Outdated Show resolved Hide resolved
src/query/run.js Outdated Show resolved Hide resolved
Co-Authored-By: jacobheun <jacobheun@gmail.com>
Copy link
Member

@vasco-santos vasco-santos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Jacob! 👏

The changes look great and the queue codebase is way better structured now. Also the performance improvements are super good, as well as the simulator!

Just detected an option missing in docs.

src/index.js Outdated
/**
* Number of closest peers to return on kBucket search, default 20
*
* @type {number}
*/
this.ncp = options.ncp || c.K
this.ncp = options.ncp || this.kBucketSize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docs to options.ncp ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, looking through the code we actually only use dht.ncp in 1 spot. I kind of think we should just get rid of it altogether and replace that occurrence with dht.kBucketSize. While it could potentially provide some finer grained control of the number of results we're returning, it's not doing that and I think it adds more confusion than it's worth right now. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ncp is now gone in favor of kBucketSize

@dirkmc
Copy link
Contributor

dirkmc commented May 8, 2019

@jacobheun with these changes do you think we can remove the shallow getClosestPeers() behaviour on PUT?

@jacobheun
Copy link
Contributor Author

@jacobheun with these changes do you think we can remove the shallow getClosestPeers() behaviour on PUT?

We should be able to. It's still going to be relatively slow until we can improve the connection latency. I think we can do some comparative tests against this and a follow up PR for removing shallow to see what the impacts are. Shallow is definitely going to be much less accurate.

Copy link
Member

@vasco-santos vasco-santos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vasco-santos vasco-santos merged commit ddf80fe into master May 8, 2019
@ghost ghost removed the status/in-progress In progress label May 8, 2019
@vasco-santos vasco-santos deleted the fix/performance branch May 8, 2019 10:18
@vasco-santos
Copy link
Member

🚢 @0.4.14

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants