Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul UDP server #141

Merged
merged 3 commits into from
May 29, 2024
Merged

Overhaul UDP server #141

merged 3 commits into from
May 29, 2024

Conversation

jtackaberry
Copy link
Contributor

@jtackaberry jtackaberry commented Aug 14, 2023

Following up from #140, the UDP server implementation has two fatal flaws:

  1. UDP buffers were too small, resulting in truncated data being forwarded to upstreams for datagrams exceeding 1024 bytes (the buffer size)
  2. A goroutine was created per-packet that called the proxy handler but never terminated

1 is an easy enough fix. 2 is addressed by this PR in the following way:

  • UDP "connections" are tracked in a map called udpConns (keyed on downstream client addr:port) for the given UDP server port
  • When a new packet is received by a UDP server, the udpConns map is checked for that remote addr:port.
    • If it doesn't exist:
      • a new packetConn is created for that downstream and a goroutine is created to call the long-running proxy handler
      • packetConn has been updated to include additional fields to allow for communication with the UDP server
      • Notably, there's a channel for packetConn to read new packets from the server for this particular downstream, and there's a channel for it to communicate back to the UDP server that the connection has been closed.
      • (This latter case allows the UDP server loop to both add and remove elements from the udpConns map, which avoids the need to acquire a mutex for each packet.)
    • If it does exist:
      • the packet is send to the packetConn's packet channel
  • packetConn now has the concept of an idle timeout. If no new packets are received from the downstream within a period of time (currently hardcoded) then the connection is considered closed. Any pending Read() call is immediately returned with io.EOF, which in turn causes the upstream connections to be closed, which allows the handler (and the per-connection goroutine which called it) to terminate.

I've tested this with HTTP/3 (with QUIC-enabled nginx and curl). Problem 1 stopped the current master branch dead in its tracks, since it couldn't get past the initial handshake (due to the first packet being 1200 bytes, which got truncated by the 1024 byte buffer).

Even after fixing that in master, the QUIC handshake didn't complete because each new packet from the downstream would be forwarded to the upstream from a different source port, so the QUIC server side saw them as unrelated packets.

With this PR, curl is able to fetch via caddy to the nginx upstream, both short requests as well as bulk downloads. Performance isn't brilliant, it must be said: with all 3 things (curl, caddy, nginx) running on the same box, caddy runs around 200-220% CPU in order to bottleneck nginx, which has a single thread bottleneck so runs at 98%+. The nginx bottleneck caps effective throughput to about 70-80MB/s. Disappointing results from QUIC there, but not Caddy's fault. :)

But it works, which is an already an improvement.

Pending problems:

  1. Channel buffer lengths are hardcoded and need some further consideration
  2. UDP connection idle timeout is currently hardcoded, but should probably be configurable per server
  3. Existing UDP connections are borked by config reloads

The last problem is the biggest obstacle right now, and I'm not really sure how to fix it. Here's what happens:

  • Handler.proxy() is diligently doing io.Copy() for upstream->downstream and downstream->upstream (in separate goroutines)
  • When config is reloaded, thanks to UDP upstreams changes are not detected on config reload #132, the server UDP socket is closed, which allows the UDP server loop to terminate
  • But because the original UDP server socket is closed, the io.Copy() invocations both immediately abort with "use of closed network connection"
  • Consequently the QUIC transfer abruptly stalls out and the client needs to reconnect

This isn't a problem for TCP, because we get a separate socket representing the TCP connection that's independent from the listen socket. This isn't the case for UDP, where incoming data from all downstreams is always received over the same socket, so any in-flight handlers proxying data between upstream/downstream depend on this socket continuing to exist.

This is a consequence of how config reloading is done in general within Caddy, so I think I'll need to depend on your suggestions at this point. It's the main thing keeping this PR in draft status -- although even with that problem, the code in the PR still significantly improves UDP proxying behavior with caddy-l4.

@jtackaberry jtackaberry changed the title Overall UDP server Overhaul UDP server Aug 14, 2023
layer4/server.go Outdated Show resolved Hide resolved
@francislavoie
Copy link
Collaborator

The nginx bottleneck caps effective throughput to about 70-80MB/s. Disappointing results from QUIC there, but not Caddy's fault. :)

You could try running a second Caddy instance instead of nginx for QUIC/H3, I'm curious if it would be less bottlenecked 🤔

@jtackaberry
Copy link
Contributor Author

jtackaberry commented Aug 14, 2023

You could try running a second Caddy instance instead of nginx for QUIC/H3, I'm curious if it would be less bottlenecked

About 125MB/s bypassing caddy-l4, directly between curl and Caddy. There Caddy burns 230% CPU, but that's over loopback so in practice it just moves the bottleneck to curl which itself is single threaded.

For comparison, passing --http2 to curl instead of --http3 reaches about 850MB/s, and now Caddy is bottlenecked, with 1 thread capping out at 100%. There, curl runs at about 75% CPU.

Back to HTTP/3, when doing curl -> caddy-l4 -> caddy, caddy-l4 is driven to about 250% CPU. Wondering how much performance is still on the table, I wrote the simplest possible UDP proxy in Go with no unnecessary allocations, and that runs about 130%. Then I rewrote it in C (using level-triggered epoll) and that runs around 90%.

I think the unfortunate reality is that UDP proxying is just going to be inherently slow until the industry catches up to HTTP/3. For bulk data transfers, I imagine UDP hasn't seen nearly as much optimization effort as TCP has. For example, sendfile(2) works with TCP but not with UDP, which means, unlike with TCP, there's no avoiding kernel<->userspace copying for HTTP/3 proxies using standard syscalls. (AFAICT.)

As HTTP/3 gains increasing adoption I'm sure this will get more attention, but how long it will take is anyone's guess.

@mholt
Copy link
Owner

mholt commented Aug 15, 2023

This is fascinating progress!! Sorry I haven't replied yet -- been a very busy weekend with family, and today I caught up on a bunch of work things that also came in over the weekend (which is unusual). So I'm hoping to look at this more this week, but I want you to know this sounds really great and I'm excited to look at it!

@mholt
Copy link
Owner

mholt commented Aug 17, 2023

I should mention that Caddy has two open PRs which I believe should address the "biggest obstacle" (problem 3):

Will be circling back to this in a bit. Releasing Caddy v2.7.4 now...

@WeidiDeng
Copy link
Contributor

@mholt Problem 3 will be fixed by my pr. As it allows udp listeners to expose the underlying socket at the cost of managing the life cycle of udp socket by oneself. (Trivia if using listenerPool with a custom network key and constructor like the new ListenQUIC() does.)

Applying SO_REUSEPORT will correctly terminate udp socket io loops as unlike tcp, udp communications happen over a single socket (We close tcp listeners only, not connections accpeted by it).

@mholt
Copy link
Owner

mholt commented Aug 18, 2023

@WeidiDeng Gotcha. So, to clarify, both our PRs will be correct (once mine is fixed a little bit with regards to unixgram) and both should be merged (when they're ready)?

@mholt
Copy link
Owner

mholt commented Oct 17, 2023

@jtackaberry I've just pushed a commit to Caddy upstream that uses SO_REUSEPORT for UDP sockets (with @WeidiDeng's help) -- do you think that'll fix the last problem you mentioned?

@mholt
Copy link
Owner

mholt commented Dec 13, 2023

@jtackaberry With that last problem addressed, is there anything else left before I merge this?

@jtackaberry
Copy link
Contributor Author

@jtackaberry With that last problem addressed, is there anything else left before I merge this?

Sorry for leaving this hanging. I have a bit of free time coming up after the week is over so I'm aiming to test things out again next week (or possibly the weekend).

@mholt
Copy link
Owner

mholt commented Dec 13, 2023

No worries, I also left for quite a gap of time (we had a baby).

Anyway, yeah, whenever you have a chance, I'll be happy to merge it once you're happpy with things. 💯

@mholt
Copy link
Owner

mholt commented Feb 28, 2024

@jtackaberry Hey, I keep losing track of things, lol. Any interest in working together to finish this up to your satisfaction?

@jtackaberry
Copy link
Contributor Author

jtackaberry commented Feb 28, 2024

@jtackaberry Hey, I keep losing track of things, lol. Any interest in working together to finish this up to your satisfaction?

@mholt You and me both! Still interested in moving this along, just got caught up in a couple other projects in the meantime. I do tend to chase a lot of squirrels. :)

I will at least take a couple hours within the next week to kick the tires and get a refreshed look of the lay of the land in light of Caddy's (not-so-)recent changes.

@mholt
Copy link
Owner

mholt commented Feb 28, 2024

Ha, sounds good. Looking forward to it 😄

@jtackaberry
Copy link
Contributor Author

Moving up on the to-do list but free time has been an unexpected scarcity. Sorry for being such a flake. Other obligations have ended up decimating my weekends through March. Meanwhile I just wanted to report in that I'm still alive and invested. :)

@francoismichel
Copy link

Hi !

The PR looks great and would enable to write nice QUIC plugins ! Thanks for that !
What are the next steps before we can think of merging it ?
It is totally fine if the it requires more thought and time. Though, it might still be useful to look at least integrate the UDP buffer changes to be able stat developing UDP/QUIC plugins. The current 1024 UDP buffer value is the hard blocker as no QUIC packet can pass through the proxy.

What do you think ?

Cheers and thanks a lot for the work !

@mholt
Copy link
Owner

mholt commented Apr 11, 2024

This would be cool indeed, as @francoismichel 's work would allow this UDP proxy to be used as a HTTP/3 proxy that multiplexes at the SNI layer (no decryption needed).

@mholt
Copy link
Owner

mholt commented May 7, 2024

Given where this PR is at, I think it's looking pretty good (without having tested it as thoroughly as @jtackaberry) -- and I might be inclined to just go ahead and merge it soon, with the possibility for follow-up patches later if needed.

@lxhao61 lxhao61 mentioned this pull request May 26, 2024
@lxhao61
Copy link

lxhao61 commented May 26, 2024

@mholt This PR is very useful and I hope it will be merged soon.

Copy link
Owner

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtackaberry I'll merge what you've got so far, I think it's probably best to get it out there at this point and have people try it and iterate on it. And you can always revisit it when you have time! 👍

Thanks for the great contribution!

@mholt mholt merged commit 6a8be7c into mholt:master May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants