Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky TestGlobalPreferenceV4 on Ubuntu #1611

Closed
marten-seemann opened this issue Jun 23, 2022 · 10 comments · Fixed by #1673
Closed

flaky TestGlobalPreferenceV4 on Ubuntu #1611

marten-seemann opened this issue Jun 23, 2022 · 10 comments · Fixed by #1673
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@marten-seemann
Copy link
Contributor

https://github.com/libp2p/go-libp2p/runs/7028901565?check_suite_focus=true

  === RUN   TestGlobalPreferenceV4
      transport_test.go:140: when listening on /ip4/127.0.0.1/tcp/0, should prefer /ip4/127.0.0.1/tcp/0 over /ip4/10.1.1.88/tcp/0
      transport_test.go:142: when listening on /ip4/127.0.0.1/tcp/0, should prefer /ip4/0.0.0.0/tcp/0 over /ip4/10.1.1.88/tcp/0
      transport_test.go:180: dialed /ip4/127.0.0.1/tcp/46587 from 127.0.0.1:60340. expected to dial from port [45733]
      transport_test.go:193: dialed /ip4/127.0.0.1/tcp/46587 from 127.0.0.1:60344. expected to dial from port [45733]
      transport_test.go:145: when listening on /ip4/10.1.1.88/tcp/0, should prefer /ip4/0.0.0.0/tcp/0 over /ip4/127.0.0.1/tcp/0
@marten-seemann marten-seemann added the kind/bug A bug in existing code (including security flaws) label Jun 23, 2022
@schomatis
Copy link

Assigning myself.

@schomatis
Copy link

Working on this.

@schomatis
Copy link

We're hitting the global/fallback dialer

con, err = fallbackDialer.DialContext(ctx, network, raddr)

which doesn't have the reuse-port settings like the local variable

d := net.Dialer{
LocalAddr: laddr,
Control: reuseport.Control,
}

I'm thinking we can either:

  1. Retry using another local/ephemeral Dialer.
  2. Keep using the fallbackDialer but with the Control set (and maybe unset later if needed; not sure how's the concurrency here).

@Stebalien Could you expand on the rationale of the fallbackDialer added in 0b6d56b, please? That would give me more context to know which way to go.

@Stebalien
Copy link
Member

We hit the fallback dialer if dialing with reuseport fails for some reason. We do this because reuseport doesn't always work:

  • It has some issues on some operating systems.
  • If there was a previous connection that's now in the TIME_WAIT state, re-creating the same connection with the same source port will fail (so we'll use a random source port with the fallback dialer).

The first thing to investigate would be why the first dial is failing.

@Stebalien
Copy link
Member

The first thing to investigate would be why the first dial is failing.

Maybe the other side is dialing us automatically for some reason? Or it could be some form of spurious failure.

@schomatis
Copy link

func testPrefer(t *testing.T, listen, prefer, avoid ma.Multiaddr) {
var trA Transport
var trB Transport
listenerA, err := trA.Listen(listen)
if err != nil {
t.Fatal(err)
}
defer listenerA.Close()
listenerB1, err := trB.Listen(avoid)
if err != nil {
t.Fatal(err)
}
defer listenerB1.Close()
dialOne(t, &trB, listenerA, listenerB1.Addr().(*net.TCPAddr).Port)

What I'm seeing locally on my 5.4.0-99-generic (which may not be the same issue as on the CI server) is the random port assigned to listener B with (in the case of this test) the interface address can be a port already used by the system, say, in my host, the containerd (root) process port: 192.168.26.140:42315. Then when dialing from that transport as the source, trying to bind to the same port, but now with the all-zeros address (0.0.0.0:42315), it fails with EADDRINUSE.

From my limited understanding of address/port reuse rules this isn't what I would have expected, so I might be misinterpreting some part of the failed test. Will double check to confirm the above.

@Stebalien Does any of this make sense to you?

@Stebalien
Copy link
Member

There shouldn't be a conflict between the host and the container. However, there may be a conflict between 127.0.0.1 and the external address?

Are you sure the container isn't just auto-forwarding ports?

@schomatis
Copy link

@Stebalien Sorry, the containerd was just a random example of an open port I'm hitting in my local setting, but this can happen with any open port (in a certain range I imagine). Forgetting containers altogether, this is what I'm seeing (for a random port in the normal range, say, 41071):

Everything works if the port is not bound by the root process.

Bind to wildcard 0.0.0.0:41071 (sudo nc -l 41071) and the listen just fails:

listen tcp4 192.168.26.140:41071: bind: address already in use

Bind to localhost 127.0.0.1:41071 (sudo nc -l 127.0.0.1 41071) and the listen succeeds but then the dial from that port (which is done from the 0.0.0.0 address) will fail to bind:

tcp4 0.0.0.0:41071->127.0.0.1:46327: bind: address already in use

This was tested forcing the listener on the 41071 port, but if I leave the random 0, with enough tries, I can see that it gets assigned 41071. (The listener in this test uses the interface address, here 192.168.26.140.)

Stebalien added a commit that referenced this issue Aug 12, 2022
This address may already be in-use (on that other address) somewhere
else.

Thanks to @schomatis for figuring this out.

fixes #1611
@Stebalien
Copy link
Member

Yeah, that makes sense. So this looks like an actual bug. Basically, we shouldn't do:

// One. Always dial from the single port we're listening on.
laddr := &net.TCPAddr{
IP: unspec,
Port: port,
}
return (*singleDialer)(laddr)

or:

return reuseDial(ctx, randAddr(d.unspecified), network, addr)

Instead, if we "don't know", we should just not set a local address.

I've filed a patch here because, well, I wrote this code and it was... overcomplicated.

@Stebalien
Copy link
Member

#1673

Stebalien added a commit that referenced this issue Aug 16, 2022
This address may already be in-use (on that other address) somewhere
else.

Thanks to @schomatis for figuring this out.

fixes #1611
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants