Replace listen distributor task with multithreaded `SO_REUSEPORT` task. #410

XAMPPRocky · 2021-10-04T07:11:16Z

Currently all of our traffic goes through a distributor task which distributes all messages in the UDP buffer amongst all workers. Under heavy workloads this is likely to be one of the main bottlenecks in the program. While reading this blog post it introduced me to the SO_REUSEPORT, which is designed specifically to address this bottleneck in network applications.

Using SO_REUSEPORT and SO_REUSEADDRESS we can eliminate the distributor task entirely, and have each worker entirely responsible for their socket, this has the potential to have serious performance improvements as seen in the blog post where the reused port server continues to scale linearly past 300,000 as the number clients increased, while the listen distributor server struggles to reach that.

The text was updated successfully, but these errors were encountered:

iffyio · 2021-10-04T09:11:23Z

This sounds worth exploring indeed! I guess the idea would be to give each worker its own socket with this enabled?

XAMPPRocky · 2021-10-04T09:25:14Z

Yeah, the exact code they used is here. I included inline the socket stuff for convenience. One thing to figure out would be if there's an equivalent option for Windows. MacOS does have it, though it's behaviour is slightly different, in that Linux behaves specially, it seems mostly in relation its TCP implementation though, which isn't relevant for us. I've included some good sources on it.

    let sock = socket2::Socket::new(
        match addr {
            SocketAddr::V4(_) => socket2::Domain::IPV4,
            SocketAddr::V6(_) => socket2::Domain::IPV6,
        },
        socket2::Type::STREAM,
        None,
    )
    .unwrap();


    sock.set_reuse_address(true).unwrap();
    sock.set_reuse_port(true).unwrap();
    sock.set_nonblocking(true).unwrap();
    sock.bind(&addr.into()).unwrap();
    sock.listen(8192).unwrap();


    let incoming =
        tokio_stream::wrappers::TcpListenerStream::new(TcpListener::from_std(sock.into()).unwrap());

markmandel · 2021-10-04T16:55:58Z

Very cool! Excited to see the results!

One thing to figure out would be if there's an equivalent option for Windows. MacOS does have it, though it's behaviour is slightly different.

My thought would be: Given that most of our high-load workloads (I expect) will happen on Linux, as long as the system works for a single connection (client side) in a reasonably performant way, on Win and Mac I expect that will be fine.

XAMPPRocky · 2021-10-04T17:02:24Z

as long as the system works for a single connection (client side) in a reasonably performant way, on Win and Mac I expect that will be fine.

Yeah, my concern with Windows is more if there isn't a good equivalent, we have to maintain a workaround just for windows which could be awkward, if it's not perfectly as performant that's not as much an issue.

markmandel · 2021-11-30T19:08:30Z

Was just reading about this some more, noted that for Tokio's UDP Socket there exists:
https://docs.rs/tokio/1.14.0/tokio/net/struct.UdpSocket.html#method.from_std

Creates new UdpSocket from a previously bound std::net::UdpSocket.

This function is intended to be used to wrap a UDP socket from the standard library in the Tokio equivalent. The conversion assumes nothing about the underlying socket; it is left up to the user to set it in non-blocking mode.

This can be used in conjunction with socket2’s Socket interface to configure a socket before it’s handed off, such as setting options like reuse_address or binding to multiple addresses.

Mostly just writing this here in case I come back around looking for it again.

markmandel · 2021-11-30T19:25:21Z

Yeah, my concern with Windows is more if there isn't a good equivalent, we have to maintain a workaround just for windows which could be awkward, if it's not perfectly as performant that's not as much an issue.

I think socket2 handles this for us, to a degree:

https://docs.rs/socket2/latest/socket2/struct.Socket.html

This type simply wraps an instance of a file descriptor (c_int) on Unix and an instance of SOCKET on Windows. This is the main type exported by this crate and is intended to mirror the raw semantics of sockets on platforms as closely as possible.

And it looks like Windows supports SO_REUSEADDR:
https://docs.microsoft.com/en-us/windows/win32/winsock/using-so-reuseaddr-and-so-exclusiveaddruse

But we might require some different settings for each OS, which we should be able to conditionally check and respond to:
https://stackoverflow.com/questions/13637121/so-reuseport-is-not-defined-on-windows-7

But this definitely looks very doable, even with the current architecture.

XAMPPRocky · 2021-11-30T20:14:40Z

Yeah, when I built a small proof of just using the options, it meant that essentially a single worker was being used for everything while the other works sat idle. If the main worker failed the one of the other sockets would start receiving the traffic, so not the worst behaviour

markmandel · 2021-11-30T21:02:17Z

That's unfortunate 😞 from reading, I had thought that SO_REUSEADDR on windows would work the same as SO_REUSEPORT -- but I was never quite sure from my readings.

I also found https://stackoverflow.com/questions/14388706/how-do-so-reuseaddr-and-so-reuseport-differ?rq=1 quite interesting for differences across platforms.

I ended up going down the rabbit hole, it's super interesting stuff.

markmandel · 2022-01-24T23:17:28Z

So I want to take a stab at this, primarily because in my tests, I'm seeing some performance differences of read being slower than write -- and that upsets me 😄 (also because this is interesting).

For example: You can see it the difference in 99% in read vs write when doing a demo with Xonotic (screencap from our example).

The first thought I had -- was to look at our existing benchmarks, and see if we could capture not just throughput entirely, but split the data also our by read and write - then we can also do some comparison and/or narrow down each area individually.

Then I can step into attempting to fit this into our current architecture (which I actually don't think will be too hard - but famous last works 😄 ).

Sound good?

XAMPPRocky · 2022-01-25T13:18:18Z

SGTM

Wanted to be able to highlight if we had bottlenecks in performance on read vs write operations on the proxy. This adds an extra benchmark to throughput.rs called "readwrite" and follows a similar pattern as the overall throughput benchmark, with both direct and proxies traffic utilised as extra comparison values. Work on googleforgames#410

Wanted to be able to highlight if we had bottlenecks in performance on read vs write operations on the proxy. This adds an extra benchmark to throughput.rs called "readwrite" and follows a similar pattern as the overall throughput benchmark, with both direct and proxies traffic utilised as extra comparison values. Work on #410

markmandel · 2022-04-14T04:20:02Z

Started work on the implementation for local packet reception. Will provide some benchmarks when I've got something working.

markmandel · 2022-04-19T00:08:34Z

Making progress! Seem to have the basics working, but now running into some kind of race conditions in the unit tests that were not happening before around packet reception and packet sending. Looking into it.

https://github.com/markmandel/quilkin/tree/wip/reuse-port is anyone wants to take a peek.

markmandel · 2022-06-03T02:24:46Z

Got it working nicely on my end, will start pulling out PRs and submitting. Code is way cleaner, and we can remove a bunch of channel and worker code along the way (oh and I need to do the windows build!)

With the single-client benchmarks, we see a few us shaved off, but I would expect better results with multiple clients.

e.g. (throughput benchmark)

Current:

With SO_REUSEPORT we see:

On readwrite, similarly:

Before,

With SO_REUSEPORT

Implemented the use of SO_REUSEPORT for *nix systems and SO_REUSEADDR for Windows systems. This removes a lot of the code needed for channel coordination that was previously in place, and simplifies much of the architecture, as well as improving performance. Closes googleforgames#410

Implemented the use of SO_REUSEPORT for *nix systems and SO_REUSEADDR for Windows systems. This removes a lot of the code needed for channel coordination that was previously in place, and simplifies much of the architecture, as well as improving performance. Closes #410 Co-authored-by: XAMPPRocky <4464295+XAMPPRocky@users.noreply.github.com>

XAMPPRocky added kind/feature New feature or request area/performance Anything to do with Quilkin being slow, or making it go faster. kind/cleanup Refactoring code, fixing up documentation, etc labels Oct 4, 2021

XAMPPRocky mentioned this issue Oct 4, 2021

Evaluate parallel design for filter processing #411

Open

XAMPPRocky self-assigned this Oct 5, 2021

XAMPPRocky mentioned this issue Oct 7, 2021

Accept Docker service hostnames in addition to IP addresses in configuration. #415

Closed

XAMPPRocky added the area/networking Related to networking I/O label Oct 11, 2021

XAMPPRocky mentioned this issue Oct 13, 2021

Move iPerf benchmark to use Docker #421

Merged

markmandel mentioned this issue Feb 1, 2022

Benchmark comparing read and write throughput #479

Merged

XAMPPRocky assigned markmandel and unassigned XAMPPRocky Apr 14, 2022

markmandel mentioned this issue Apr 19, 2022

Port Reuse: Would love some help - unit tests only pass most of the time. #521

Closed

This was referenced Jun 3, 2022

Upgrade Tokio to 1.19.2 #536

Merged

Test Utility to enable tracing logging. #537

Merged

This was referenced Jun 4, 2022

connect() Sessions socket to Endpoint address #538

Merged

Explore performance with io_uring #542

Open

markmandel mentioned this issue Jun 10, 2022

Implemented Port reuse for downstream connection. #543

Merged

XAMPPRocky closed this as completed in #543 Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace listen distributor task with multithreaded `SO_REUSEPORT` task. #410

Replace listen distributor task with multithreaded `SO_REUSEPORT` task. #410

XAMPPRocky commented Oct 4, 2021 •

edited

Loading

iffyio commented Oct 4, 2021

XAMPPRocky commented Oct 4, 2021 •

edited

Loading

markmandel commented Oct 4, 2021 •

edited

Loading

XAMPPRocky commented Oct 4, 2021

markmandel commented Nov 30, 2021

markmandel commented Nov 30, 2021

XAMPPRocky commented Nov 30, 2021

markmandel commented Nov 30, 2021

markmandel commented Jan 24, 2022

XAMPPRocky commented Jan 25, 2022

markmandel commented Apr 14, 2022

markmandel commented Apr 19, 2022

markmandel commented Jun 3, 2022

Replace listen distributor task with multithreaded SO_REUSEPORT task. #410

Replace listen distributor task with multithreaded SO_REUSEPORT task. #410

Comments

XAMPPRocky commented Oct 4, 2021 • edited Loading

iffyio commented Oct 4, 2021

XAMPPRocky commented Oct 4, 2021 • edited Loading

markmandel commented Oct 4, 2021 • edited Loading

XAMPPRocky commented Oct 4, 2021

markmandel commented Nov 30, 2021

markmandel commented Nov 30, 2021

XAMPPRocky commented Nov 30, 2021

markmandel commented Nov 30, 2021

markmandel commented Jan 24, 2022

XAMPPRocky commented Jan 25, 2022

markmandel commented Apr 14, 2022

markmandel commented Apr 19, 2022

markmandel commented Jun 3, 2022

Replace listen distributor task with multithreaded `SO_REUSEPORT` task. #410

Replace listen distributor task with multithreaded `SO_REUSEPORT` task. #410

XAMPPRocky commented Oct 4, 2021 •

edited

Loading

XAMPPRocky commented Oct 4, 2021 •

edited

Loading

markmandel commented Oct 4, 2021 •

edited

Loading