Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace listen distributor task with multithreaded SO_REUSEPORT task. #410

Closed
XAMPPRocky opened this issue Oct 4, 2021 · 13 comments · Fixed by #543
Closed

Replace listen distributor task with multithreaded SO_REUSEPORT task. #410

XAMPPRocky opened this issue Oct 4, 2021 · 13 comments · Fixed by #543
Assignees
Labels
area/networking Related to networking I/O area/performance Anything to do with Quilkin being slow, or making it go faster. kind/cleanup Refactoring code, fixing up documentation, etc kind/feature New feature or request

Comments

@XAMPPRocky
Copy link
Collaborator

XAMPPRocky commented Oct 4, 2021

Currently all of our traffic goes through a distributor task which distributes all messages in the UDP buffer amongst all workers. Under heavy workloads this is likely to be one of the main bottlenecks in the program. While reading this blog post it introduced me to the SO_REUSEPORT, which is designed specifically to address this bottleneck in network applications.

Using SO_REUSEPORT and SO_REUSEADDRESS we can eliminate the distributor task entirely, and have each worker entirely responsible for their socket, this has the potential to have serious performance improvements as seen in the blog post where the reused port server continues to scale linearly past 300,000 as the number clients increased, while the listen distributor server struggles to reach that.

@XAMPPRocky XAMPPRocky added kind/feature New feature or request area/performance Anything to do with Quilkin being slow, or making it go faster. kind/cleanup Refactoring code, fixing up documentation, etc labels Oct 4, 2021
@iffyio
Copy link
Collaborator

iffyio commented Oct 4, 2021

This sounds worth exploring indeed! I guess the idea would be to give each worker its own socket with this enabled?

@XAMPPRocky
Copy link
Collaborator Author

XAMPPRocky commented Oct 4, 2021

Yeah, the exact code they used is here. I included inline the socket stuff for convenience. One thing to figure out would be if there's an equivalent option for Windows. MacOS does have it, though it's behaviour is slightly different, in that Linux behaves specially, it seems mostly in relation its TCP implementation though, which isn't relevant for us. I've included some good sources on it.

    let sock = socket2::Socket::new(
        match addr {
            SocketAddr::V4(_) => socket2::Domain::IPV4,
            SocketAddr::V6(_) => socket2::Domain::IPV6,
        },
        socket2::Type::STREAM,
        None,
    )
    .unwrap();


    sock.set_reuse_address(true).unwrap();
    sock.set_reuse_port(true).unwrap();
    sock.set_nonblocking(true).unwrap();
    sock.bind(&addr.into()).unwrap();
    sock.listen(8192).unwrap();


    let incoming =
        tokio_stream::wrappers::TcpListenerStream::new(TcpListener::from_std(sock.into()).unwrap());

@markmandel
Copy link
Contributor

markmandel commented Oct 4, 2021

Very cool! Excited to see the results!

One thing to figure out would be if there's an equivalent option for Windows. MacOS does have it, though it's behaviour is slightly different.

My thought would be: Given that most of our high-load workloads (I expect) will happen on Linux, as long as the system works for a single connection (client side) in a reasonably performant way, on Win and Mac I expect that will be fine.

@XAMPPRocky
Copy link
Collaborator Author

as long as the system works for a single connection (client side) in a reasonably performant way, on Win and Mac I expect that will be fine.

Yeah, my concern with Windows is more if there isn't a good equivalent, we have to maintain a workaround just for windows which could be awkward, if it's not perfectly as performant that's not as much an issue.

@markmandel
Copy link
Contributor

Was just reading about this some more, noted that for Tokio's UDP Socket there exists:
https://docs.rs/tokio/1.14.0/tokio/net/struct.UdpSocket.html#method.from_std

Creates new UdpSocket from a previously bound std::net::UdpSocket.

This function is intended to be used to wrap a UDP socket from the standard library in the Tokio equivalent. The conversion assumes nothing about the underlying socket; it is left up to the user to set it in non-blocking mode.

This can be used in conjunction with socket2’s Socket interface to configure a socket before it’s handed off, such as setting options like reuse_address or binding to multiple addresses.

Mostly just writing this here in case I come back around looking for it again.

@markmandel
Copy link
Contributor

Yeah, my concern with Windows is more if there isn't a good equivalent, we have to maintain a workaround just for windows which could be awkward, if it's not perfectly as performant that's not as much an issue.

I think socket2 handles this for us, to a degree:

https://docs.rs/socket2/latest/socket2/struct.Socket.html

This type simply wraps an instance of a file descriptor (c_int) on Unix and an instance of SOCKET on Windows. This is the main type exported by this crate and is intended to mirror the raw semantics of sockets on platforms as closely as possible.

And it looks like Windows supports SO_REUSEADDR:
https://docs.microsoft.com/en-us/windows/win32/winsock/using-so-reuseaddr-and-so-exclusiveaddruse

But we might require some different settings for each OS, which we should be able to conditionally check and respond to:
https://stackoverflow.com/questions/13637121/so-reuseport-is-not-defined-on-windows-7

But this definitely looks very doable, even with the current architecture.

@XAMPPRocky
Copy link
Collaborator Author

Yeah, when I built a small proof of just using the options, it meant that essentially a single worker was being used for everything while the other works sat idle. If the main worker failed the one of the other sockets would start receiving the traffic, so not the worst behaviour

@markmandel
Copy link
Contributor

That's unfortunate 😞 from reading, I had thought that SO_REUSEADDR on windows would work the same as SO_REUSEPORT -- but I was never quite sure from my readings.

I also found https://stackoverflow.com/questions/14388706/how-do-so-reuseaddr-and-so-reuseport-differ?rq=1 quite interesting for differences across platforms.

I ended up going down the rabbit hole, it's super interesting stuff.

@markmandel
Copy link
Contributor

So I want to take a stab at this, primarily because in my tests, I'm seeing some performance differences of read being slower than write -- and that upsets me 😄 (also because this is interesting).

For example: You can see it the difference in 99% in read vs write when doing a demo with Xonotic (screencap from our example).

image

The first thought I had -- was to look at our existing benchmarks, and see if we could capture not just throughput entirely, but split the data also our by read and write - then we can also do some comparison and/or narrow down each area individually.

Then I can step into attempting to fit this into our current architecture (which I actually don't think will be too hard - but famous last works 😄 ).

Sound good?

@XAMPPRocky
Copy link
Collaborator Author

SGTM

markmandel added a commit to markmandel/quilkin that referenced this issue Feb 1, 2022
Wanted to be able to highlight if we had bottlenecks in performance on
read vs write operations on the proxy.

This adds an extra benchmark to throughput.rs called "readwrite" and
follows a similar pattern as the overall throughput benchmark, with both
direct and proxies traffic utilised as extra comparison values.

Work on googleforgames#410
XAMPPRocky pushed a commit that referenced this issue Feb 3, 2022
Wanted to be able to highlight if we had bottlenecks in performance on
read vs write operations on the proxy.

This adds an extra benchmark to throughput.rs called "readwrite" and
follows a similar pattern as the overall throughput benchmark, with both
direct and proxies traffic utilised as extra comparison values.

Work on #410
@markmandel
Copy link
Contributor

Started work on the implementation for local packet reception. Will provide some benchmarks when I've got something working.

@XAMPPRocky XAMPPRocky assigned markmandel and unassigned XAMPPRocky Apr 14, 2022
@markmandel
Copy link
Contributor

Making progress! Seem to have the basics working, but now running into some kind of race conditions in the unit tests that were not happening before around packet reception and packet sending. Looking into it.

https://github.com/markmandel/quilkin/tree/wip/reuse-port is anyone wants to take a peek.

@markmandel
Copy link
Contributor

Got it working nicely on my end, will start pulling out PRs and submitting. Code is way cleaner, and we can remove a bunch of channel and worker code along the way (oh and I need to do the windows build!)

With the single-client benchmarks, we see a few us shaved off, but I would expect better results with multiple clients.

e.g. (throughput benchmark)

Current:

image

With SO_REUSEPORT we see:

image

On readwrite, similarly:

Before,

image

With SO_REUSEPORT

image

markmandel added a commit to markmandel/quilkin that referenced this issue Jun 10, 2022
Implemented the use of SO_REUSEPORT for *nix systems and SO_REUSEADDR
for Windows systems.

This removes a lot of the code needed for channel coordination that was
previously in place, and simplifies much of the architecture, as well as
improving performance.

Closes googleforgames#410
XAMPPRocky added a commit that referenced this issue Jun 21, 2022
Implemented the use of SO_REUSEPORT for *nix systems and SO_REUSEADDR
for Windows systems.

This removes a lot of the code needed for channel coordination that was
previously in place, and simplifies much of the architecture, as well as
improving performance.

Closes #410

Co-authored-by: XAMPPRocky <4464295+XAMPPRocky@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Related to networking I/O area/performance Anything to do with Quilkin being slow, or making it go faster. kind/cleanup Refactoring code, fixing up documentation, etc kind/feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants