Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows style WSL SO_REUSEADDR behaviour makes RPC over TCP unreliable #3909

Closed
f948lan opened this issue Mar 10, 2019 · 3 comments
Closed

Comments

@f948lan
Copy link

f948lan commented Mar 10, 2019

Your Windows build number:

  • Version 10.0.18346.1 (but all versions will be affected)

What you're doing and what's happening:

  • Using software that relies on Sun RPC communications over TCP, using clnttcp_create from libc.
  • If clnttcp_create is called without passing a pre-bound socket to use, it calls bindresvport to find an unused privileged (<1024) source port to use.
  • Sometimes bindresvport returns a port that then causes connect to fail with EADDRINUSE
  • This then causes clnttcp_create to fail, returning an error

What's wrong / what should be happening instead:

  • bindresvport relies on UNIX style behaviour when calling bind on a socket without SO_REUSEADDR set. That is for EADDRINUSE to occur if there is still a connection in state TIME_WAIT associated with that socket.
  • Because WSL follows the Windows IP stack behaviour, bind will succeed in the case of existing TIME_WAIT connections on a socket.
  • This results in a port being selected that can then cause connect to fail as above.
  • cygwin suffered the same issue, and it's version of clnttcp_create was patched to work around it. See http://cygwin.1069669.n5.nabble.com/Re-RPC-clnt-create-adress-already-in-use-td138927.html

I realise there are already open issues regarding WSL behaviour with SO_REUSEADDR set, (eg. #2915) but I don't see any for the difference without SO_REUSEADDR, hence adding this one. Hope that's the right thing to do.

(For now I'm working around this by LD_PRELOADing a wrapper for clnttcp_create which retries bindresvport to get another port in case of the connect error - this was somewhat simpler than the cygwin approach and 'works for me').

@therealkenc
Copy link
Collaborator

Hope that's the right thing to do.

Meh, can't exactly fault you for it. I strongly suspect (but can't prove) this will go chirping crickets without a repro test case with CLI steps, an strace(1) log, and the like per the template.

But consider: (a) the underlying SO_REUSEADDR divergence appears to be well understood, (b) #2915 exists, and, (c) #2915 went dark a year ago with the remarks: "We have talked to the Windows networking team that supports these options previously to see if we can reconcile the differences, but, given the legacy of these options, that has had been a difficult conversation"

Well, you be the judge. Either way, great Cygwin reference and hat-tip on your work around.

@f948lan
Copy link
Author

f948lan commented Mar 13, 2019

Yes, I can't imagine a simple solution here (changing Windows behaviour and potentially breaking native applications is clearly not an option, and I imagine wrapping the socket implementation for WSL is likely a big effort that would also carry performance implications).

But at least it's captured and might save others some head scratching. I suspect as WSL applications grow, some differences like this may have to be addressed by multi-platform code on the Linux side, but clearly I'm no authority on that :)

@therealkenc
Copy link
Collaborator

Yes, I can't imagine a simple solution here

The standard approach for such problems: A compatibility flag (in the Windows TCP/IP driver) on the socket. The flag would be set internally by WSL using WSK and externally by Cygwin in inside its socket() wrapper. Such a flag would have no impact on existing winsock programs dating back to the beginning of time.

But at least it's captured and might save others some head scratching.

Yes and thanks again. I am just not sure there are two actionable here: this and #2915. Or put another way, I can't see #2915 being fixed, but not your framing of the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants