swarm: implement smart dialing logic #2260

sukunrt · 2023-04-24T14:39:42Z

we consider private, public ip4, public ip6, relay separately.

In each group, if a quic address is present, we delay tcp addresses.
private: 30 ms delay.
public ip4: 300 ms delay.
public ip6: 300 ms delay.
relay: 300 ms delay.

If a quic-v1 address is present we don't dial quic or webtransport address on the same (ip,port) combination.
If a tcp address is present we don't dial ws or wss address on the same (ip, port) combination.
If both direct and relay addresses are present, all relay addresses are delayed by an additional 500ms. So if there's a quic relay and a tcp relay address, quic relay will be delayed by 500ms and tcp relay will be delayed by 800 ms.

All delays are set to 0 for a holepunch request.

closes: #1785

sukunrt · 2023-04-25T14:25:47Z

Some results from a 1 hour simultaneous run of kubo on the same machine

Total dial cancellations:
old: 4100
new: 1700

kubo2 is old
kubo is new

this is the prometheus query
sum by (job, transport) (increase(libp2p_swarm_dial_errors_total{error="canceled"}[$__rate_interval]))

marten-seemann · 2023-04-25T15:02:37Z

Total dial cancellations:
old: 4100
new: 1700

Impressive numbers! Two questions:

Do you have any idea what the reason for the remaining cancelations is?
Do you have any numbers on connection establishment latency? How much are we adding?

sukunrt · 2023-04-25T15:59:37Z

Do you have any idea what the reason for the remaining cancelations is?

For some reason there are many quic-draft29 cancellations. Nodes are just reporting a lot of quic addresses and not as many quic-v1 addresses. Still debugging what is causing this.

Do you have any numbers on connection establishment latency? How much are we adding?

I'll have to measure this. The handshake latency metric currently measures the latency from the time of dialing, so I'll have to instrument this number.

marten-seemann · 2023-04-26T07:18:52Z

Do you have any idea what the reason for the remaining cancelations is?

For some reason there are many quic-draft29 cancellations. Nodes are just reporting a lot of quic addresses and not as many quic-v1 addresses. Still debugging what is causing this.

Are you dialing quic-v1 and quic-draft29 in parallel? If we have a v1 address, we should never dial draft-29.

marten-seemann

A few thoughts:

Should we prioritize WebTransport over TCP (in cases where we don't have QUIC)?
Do I understand correctly that we're dialing IPv6 and IPv4 QUIC addresses in parallel?
What happens if a node gives us multiple QUIC IP addresses (of the same address family). Should we just randomly pick one and dial it?

p2p/net/swarm/dial_ranker.go

MarcoPolo

A couple of nits, but this looks great!

p2p/net/swarm/dial_ranker.go

p2p/net/swarm/dial_worker.go

p2p/net/swarm/dial_worker_test.go

MarcoPolo · 2023-04-27T23:56:12Z

p2p/net/swarm/dial_worker_test.go

@@ -342,3 +358,206 @@ func TestDialWorkerLoopConcurrentFailureStress(t *testing.T) {
 	close(reqch)
 	worker.wg.Wait()
 }
+
+func TestDialWorkerLoopRanking(t *testing.T) {


I always appreciate more tests in this part of the codebase, thanks!

A feature request for me would be to have some sort of generative test here. See testing/quick for the tool. If we could randomly generate test cases and verify that do what we expect, I'd be much more confident in rolling this out and making future changes.

Added one randomized test using testing/quick. Is this what you had in mind?

BigLep · 2023-04-28T14:35:09Z

Thanks for the work here and for the numbers.

To put the number of cancellations in context, how many total connections were established during this same 1 hour window?

sukunrt · 2023-04-29T18:44:20Z

@marten-seemann

Should we prioritize WebTransport over TCP (in cases where we don't have QUIC)?

This is the strategy in the current PR

Do I understand correctly that we're dialing IPv6 and IPv4 QUIC addresses in parallel?

Yes, I now think we should change this and dial all ipv4 addresses 300ms after ipv6.
The PR dials in parallel because my isp doesn't support ipv6 and so I didn't understand how to model that. Running kubo on a cloud environment helped here.

What happens if a node gives us multiple QUIC IP addresses (of the same address family). Should we just randomly pick one and dial it?

Excellent idea. I did some experiments and found that if a peer shares a 4001 port address and another port address, the 4001 address is more likely to be the correct one. So the strategy I've used is to sort the addresses by port number, it is likely that nodes will dial out of a much higher port than the one they choose to listen on.

Some more numbers:

kubo on a t2micro aws instance with both ipv4 and ipv6 support.

happy eyeballs (public == private | quic > tcp | ipv6 > ipv4 ):

This strategy is essentially what @marten-seemann suggests only difference being that we prioritise ipv6 over ipv4

here we first use quic addresses and then use tcp addresses
within a transport group we rank ipv6 over ipv4
The first address of the group is dialed immediately and the rest all are dialed after 300ms
The tcp group is dialed 300ms after the last quic dial
ex: quic1, quic2, quic3, tcp1, tcp2, tcp3
quic1: 0, quic2: 300, quic3: 300, tcp1: 600: tcp2: 900, tcp3: 900
public and private addresses are dialed parallely using the same logic

PR: (ip4 == ip6 == private | quic > tcp )
strategy of the pr. all tcp addresses delayed by 300ms

master: no delay

single-dial (public == private | quic > tcp | ipv6 > ipv4):
same as happy eyeball but we dial one address at a time and wait 300ms for a result.

All latency numbers are in milliseconds

Successes is the number of successful outgoing dials which resulted in a connection

Strategy	Cancellations	Successes	Cancel Fraction	Latency (50p)	Latency (80p)	Latency (90p)	Latency (95p)
master	1950	1600	0.54	90	200	240	310
happy eyeballs	510	1550	0.24	94	219	360	650
PR	1050	1997	0.34	93	200	270	538
single-dial	520	1450	0.26	95	212	340	600

I'm still debugging why happy eyeballs latency numbers are worse than single-dial latency numbers.

@BigLep

To put the number of cancellations in context, how many total connections were established during this same 1 hour window?

I'm sorry I somehow deleted prometheus data for that run 🤦‍♂️
But the above numbers are more representative and reproducible. The previous numbers were obtained on my dev machine with a isp that doesn't support ipv6 and I think simultaneous kubo runs aren't very comparable.

sukunrt · 2023-04-29T18:54:59Z

@marten-seemann

Do you have any idea what the reason for the remaining cancelations is?

Some cancellations are because the user is cancelling the dials. No successful connection is made.
Some cancellations are because tcp dial succeeds and we cancel the quic dial. Some cancellations are because we had multiple quic dials and had to cancel one of them.

In the graphs below, the first run is master(no delay), the second run is happy-eyeballs, the third run is this pr strategy where all quic addresses are dialed together.

quic- means we cancelled a quic dial and there was no successful connection
quic-tcp means we cancelled a quic dial and there was a successful tcp connection.

Here you can see, there's not much impact on cancellations where there was no successful connection.

Here we can see that tcp-quic(tcp cancelled, quic succeeded) is reduced considerably for both strategies as expected

The happy eyeballs strategy(middle one) considerably reduces quic-quic and quicv1-quicv1 cancellations

None of the strategies have much of an impact in case the successful connection was over tcp. as expected.

marten-seemann

I still need to actually understand what dial worker loop is doing. I have to admit I'm pretty lost...

p2p/net/swarm/dial_ranker.go

marten-seemann · 2023-05-09T10:17:41Z

p2p/net/swarm/clock.go

+
+// Clock is a clock that can create timers that trigger at some
+// instant rather than some duration
+type Clock interface {


Do we need to introduce a new interface here? We're using https://github.com/benbjohnson/clock elsewhere in the code base, would it be possible to just reuse that one?

we don't need a new interface.
In this specific case I'm using InstantTimer which is not provided by benbjohnson/clock, but I can use standard Timers. I didn't use benbjohnson/clock because of the negative timer not being fired immediately and I thought we were going to use our own implementation going forward.

I see that benbjohnson/clock#50 is merged. So I don't have any objections to using benbjohnson/clock.

@MarcoPolo what do you think?

The fix was released in https://github.com/benbjohnson/clock/releases/tag/v1.3.4.

If you are setting a timer based on an instant in time rather than some duration you should use this clock (which is the case with this diff). The benbjohnson clock will be flaky for this use case because you have two Goroutines that are both trying to use the value returned from Now().

Here's a simple example: that library has an internal ticker and you have your timer handler logic. Your handler wants to reset the timer for 1 minute from the moment it was called (now), and after the library has finished notifying all timers, it'll advance the clock (let's call the advanced clock time the future). If the your handler goroutine calls reset before the ticker finishes calling all timers and advancing the clock, you're fine because now the timer is registered for now+1min. But if the ticker advanced to the future you're out of luck because you've just registered the timer for the future+1min.

This isn't a problem with the benbjohnson clock, it's actually a problem with trying to mock the timer interface since this only accepts a duration not a time. Which is why this Clock interface lets you define timers that trigger at some point in time rather than by some duration in the future.

Does that make sense? If so I think we should include this logic in the codebase as a comment when this comes up again in the future, since it's not super obvious.

Another added bonus is that this mock clock can be implemented in about 100 LoC :)

Thanks @MarcoPolo. I didn't realise this case would be flaky. We should add this comment.

p2p/net/swarm/dial_worker.go

marten-seemann · 2023-05-15T09:54:51Z

The fix was released in benbjohnson/clock@v1.3.4 (release).

Continuing the discussion here: timer.Reset is not handled in v1.3.4 I've raised benbjohnson/clock#55

We can keep the current implementation for now. I'll change it to use benbjohnson/clock when it's merged.

Sounds good to me.

clean up redundant address filtering committed

MarcoPolo · 2023-05-22T23:18:32Z

The fix was released in benbjohnson/clock@v1.3.4 (release).

Continuing the discussion here: timer.Reset is not handled in v1.3.4 I've raised benbjohnson/clock#55
We can keep the current implementation for now. I'll change it to use benbjohnson/clock when it's merged.

Sounds good to me.

Making sure that you both saw my comment here: https://github.com/libp2p/go-libp2p/pull/2260/files/241fd6a912e8ec50e9dadd16e092b4de22885a42#r1201284744

MarcoPolo · 2023-05-22T23:19:24Z

Before merge:

Document this change in the CHANGELOG.md file (finally I didn't forget about this).
Document how to disable this and why you would want to.

sukunrt · 2023-05-23T11:40:29Z

Thanks @MarcoPolo
I've added an entry.
Made default dial ranker and no delay ranker public to point to godoc for the logic.

marten-seemann

This looks pretty good. A few suggestions for the metrics.

marten-seemann · 2023-06-01T09:59:21Z

dashboards/swarm/swarm.json

+          "refId": "C"
+        }
+      ],
+      "title": "Dial Ranking Delay",


Can we put the 2 new dashboards in a new row?

marten-seemann · 2023-06-01T10:25:51Z

p2p/net/swarm/swarm_metrics.go

+			Help:      "Number of addresses dialed per peer",
+			// to count histograms with integral values accurately the bucket needs to be
+			// very narrow around the integer value
+			Buckets: []float64{0, 0.99, 1, 1.99, 2, 2.99, 3, 3.99, 4, 4.99, 5},


Not sure if a histogram is the right abstraction here. We can probably also improve the graph here:

What about using a counter here (with label 1, 2, 3, 4, 5, more), and incrementing the respective counter directly?

We could then display this a pie chart, which would allows to easily see that X% of connections succeed on the first attempt, Y% on the second one, and so on. That would be more meaningful than percentiles, wouldn't it?

This sounds like a good idea. I'll try it.

marten-seemann

This is great!

Thanks @sukunrt!

BigLep · 2023-06-07T00:20:01Z

A few things from looking at https://github.com/libp2p/go-libp2p/blob/master/CHANGELOG.md#smart-dialing-

I don't think we're really selling the positive impact. We speak to how there's no/low negative impact, but can we we also summarize the positive impact?
There are snapshots of various dashboards in this PR? Are those shareable links?
What's the methodology that we're using for our metric collection in this PR. If I understand correctly, we've spun up a Kubo node with this version of go-libp2p. What is the usage pattern of that Kubo node? What peers is it dialing? Are we triggering anything on that Kubo node to force dialing of other nodes?
The table in swarm: implement smart dialing logic #2260 (comment) was useful earlier. Do we have the latest numbers of the cancellation rate and latency impact of old code vs. new code? (If that's in a dashboard, that's great).

If we don't want to embed that kind of info in the changelog itself, we could give a summary here and link to that comment.

marten-seemann · 2023-06-07T07:06:14Z

Thanks Steve, I agree. I've made some changes in #2342.

@MarcoPolo

originally explained by @MarcoPolo here: #2260 (comment)

@MarcoPolo

originally explained by @MarcoPolo here: #2260 (comment)

@MarcoPolo

originally explained by @MarcoPolo here: #2260 (comment)

@MarcoPolo

originally explained by @MarcoPolo here: libp2p/go-libp2p#2260 (comment)

sukunrt marked this pull request as draft April 24, 2023 14:39

sukunrt force-pushed the smart-dialing branch 4 times, most recently from ea7f3b4 to 67dfaba Compare April 25, 2023 13:19

sukunrt marked this pull request as ready for review April 25, 2023 14:26

sukunrt requested review from marten-seemann and MarcoPolo April 25, 2023 14:30

marten-seemann reviewed Apr 26, 2023

View reviewed changes

p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved

p2p/net/swarm/dial_ranker.go Outdated Show resolved Hide resolved

MarcoPolo reviewed Apr 27, 2023

View reviewed changes

p-shahi mentioned this pull request May 1, 2023

go-libp2p v0.28 #2237

Closed

27 tasks

sukunrt force-pushed the smart-dialing branch 3 times, most recently from f0aa41d to 9e44071 Compare May 7, 2023 12:22

implement smart dialing

38cff0f

sukunrt force-pushed the smart-dialing branch from 9e44071 to 38cff0f Compare May 7, 2023 12:29

sukunrt mentioned this pull request May 7, 2023

swarm: add a scheduler for smart dialing #2272

Closed

add more comments and tests

241fd6a

sukunrt force-pushed the smart-dialing branch from fe728b2 to 241fd6a Compare May 7, 2023 15:06

mxinden mentioned this pull request May 8, 2023

core/: Concurrent connection attempts - aka. happy eyeball libp2p/rust-libp2p#1896

Closed

change address ranking logic to dial one quic address before others

370bf93

marten-seemann reviewed May 9, 2023

View reviewed changes

merge master

5f13172

sukunrt requested a review from marten-seemann May 12, 2023 12:48

add holepunching test

f57f841

sukunrt added 2 commits May 17, 2023 17:16

add metrics for tracking dial prioritisation impact

fbd2794

clean up redundant address filtering committed

add test for webtransport filtering

72d7351

update changelog

5e3e8e2

fix flaky test

c0649ef

sukunrt force-pushed the smart-dialing branch from 212cd19 to c0649ef Compare May 23, 2023 14:45

marten-seemann reviewed Jun 1, 2023

View reviewed changes

sukunrt added 6 commits June 2, 2023 18:35

update dashboard

5e0b9a2

Merge branch 'master' into smart-dialing

efe923d

update dial ranking delay dashboard to use pie chart

ef12e0e

change <=1ms label to 'No delay' in dashboard

13a23fc

add defensive check to map presence

ac03e54

merge master

045af92

marten-seemann approved these changes Jun 3, 2023

View reviewed changes

marten-seemann merged commit 6f27081 into libp2p:master Jun 4, 2023

BigLep mentioned this pull request Jun 7, 2023

changelog: improve description of smart dialing #2342

Merged

sukunrt added a commit that referenced this pull request Jun 12, 2023

core: document why InstantTimer is required

e84d912

originally explained by @MarcoPolo here: #2260 (comment)

sukunrt added a commit that referenced this pull request Jun 12, 2023

test: document why InstantTimer is required

46c27a6

originally explained by @MarcoPolo here: #2260 (comment)

sukunrt mentioned this pull request Jun 12, 2023

test: document why InstantTimer is required #2351

Merged

marten-seemann pushed a commit that referenced this pull request Jun 15, 2023

test: document why InstantTimer is required (#2351)

d1dcb0e

originally explained by @MarcoPolo here: #2260 (comment)

MarcoPolo mentioned this pull request Mar 22, 2024

Per-connection protocol list #2693

Open

gts2030 pushed a commit to superblock-dev/go-libp2p that referenced this pull request May 23, 2024

test: document why InstantTimer is required (#2351)

4f0d962

originally explained by @MarcoPolo here: libp2p/go-libp2p#2260 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swarm: implement smart dialing logic #2260

swarm: implement smart dialing logic #2260

sukunrt commented Apr 24, 2023 •

edited

Loading

sukunrt commented Apr 25, 2023

marten-seemann commented Apr 25, 2023

sukunrt commented Apr 25, 2023 •

edited

Loading

marten-seemann commented Apr 26, 2023

marten-seemann left a comment

MarcoPolo left a comment

MarcoPolo Apr 27, 2023

sukunrt May 10, 2023

BigLep commented Apr 28, 2023

sukunrt commented Apr 29, 2023 •

edited

Loading

sukunrt commented Apr 29, 2023 •

edited

Loading

marten-seemann left a comment

marten-seemann May 9, 2023

sukunrt May 9, 2023

marten-seemann May 11, 2023

MarcoPolo May 22, 2023 •

edited

Loading

sukunrt May 23, 2023

marten-seemann commented May 15, 2023

MarcoPolo commented May 22, 2023

MarcoPolo commented May 22, 2023

sukunrt commented May 23, 2023

marten-seemann left a comment

marten-seemann Jun 1, 2023

marten-seemann Jun 1, 2023

sukunrt Jun 1, 2023

marten-seemann left a comment

BigLep commented Jun 7, 2023

marten-seemann commented Jun 7, 2023

swarm: implement smart dialing logic #2260

swarm: implement smart dialing logic #2260

Conversation

sukunrt commented Apr 24, 2023 • edited Loading

sukunrt commented Apr 25, 2023

marten-seemann commented Apr 25, 2023

sukunrt commented Apr 25, 2023 • edited Loading

marten-seemann commented Apr 26, 2023

marten-seemann left a comment

Choose a reason for hiding this comment

MarcoPolo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BigLep commented Apr 28, 2023

sukunrt commented Apr 29, 2023 • edited Loading

sukunrt commented Apr 29, 2023 • edited Loading

marten-seemann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoPolo May 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marten-seemann commented May 15, 2023

MarcoPolo commented May 22, 2023

MarcoPolo commented May 22, 2023

sukunrt commented May 23, 2023

marten-seemann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marten-seemann left a comment

Choose a reason for hiding this comment

BigLep commented Jun 7, 2023

marten-seemann commented Jun 7, 2023

sukunrt commented Apr 24, 2023 •

edited

Loading

sukunrt commented Apr 25, 2023 •

edited

Loading

sukunrt commented Apr 29, 2023 •

edited

Loading

sukunrt commented Apr 29, 2023 •

edited

Loading

MarcoPolo May 22, 2023 •

edited

Loading