std.Thread.Pool: process tree cooperation via a new jobserver protocol #20372

andrewrk · 2024-06-21T03:02:35Z

I'm tentatively calling this new standard "Jobserver V2" and it uses the environment variable JOBSERVERV2. The idea is to change the environment variable name with any ABI-breaking changes that need to be made after the protocol starts being used.

In summary, the root process in a process tree finds out that it is the root process because of the lack of JOBSERVERV2 environment variable. In this case it spawns a thread with the same lifetime as its std.Thread.Pool, which acts as a daemon, accepting up to N simultaneous connections where N is the thread pool size. It also writes 1 byte to each connection, in order to ensure blocking read() on the other side of the socket.

When a thread pool finds the JOBSERVERV2 environment variable, instead of being a host, it uses that path to obtain thread tokens before doing work. Before starting work, it connects to the daemon and reads 1 byte. Then it proceeds to do as much work as possible before closing the connection and going back to sleep. Crucially, when a process dies, the operating system closes all the connections it had open.

closes #20274

support and eventually upgrade to an alternate jobserver protocol standard that does not leak threads, in collaboration with zig rust-lang/cargo#14102

Testing

Simple Example

demo: https://asciinema.org/a/lMKZ0YZceqBLsww419YxFxjxT
source: https://gist.github.com/andrewrk/985de049693100d30e15ab3940bf601f

real world example: my music player project

It's fun to look at but performance wise it seems to be a wash. 66 seconds to build before, 66 seconds to build after. But, I suppose since it does not spawn as many concurrent processes, it is using fewer resources during that time, so maybe that's a tiny win after all.

demo: https://asciinema.org/a/dDfUwIH78FQSegQakp3y5WXDC
source: https://codeberg.org/andrewrk/player

behavior tests

before: 1m5s, 889 MiB peak RSS
after: 1m4s, 893 MiB peak RSS
demo: https://asciinema.org/a/MEkXcmpBkPejfc0BngYsjRLDk

In the demo you can see that it does not give more thread tokens out when it should. Perhaps a symptom of the implicit global token problem.

Justification for choosing `JOBSERVERV2`

Point 1: it is project-neutral, so other projects hopefully will consider it

Point 2: it implies a versioning scheme; if we have to break the ABI in the future, the 2 can become a 3.

Point 3: it is unique:

Merge Checklist

Call shutdown on the read socket to stop blocked workers on deinit
- can a sockfd which has been shutdown be reused with connect ?
Solve the implicit global token problem
- waitAndWork is implemented poorly - after it starts waiting, it won't see any newly queued work
- spawn N threads instead of N - 1, but have one of the workers be special
Performance testing
Fix the dependencies on posix that I added in various places
Implement Windows support
Should we use libdispatch on macOS?
Get feedback from third party projects such as GNU make, ninja, Meson, etc.

joshtriplett · 2024-06-21T03:26:10Z

👍 for embedding the version in the environment variable name. I think this will make it easier to work with for consumers of the protocol.

Suggestion: Please define in the protocol that the byte written must be 0, to allow for the possibility of future expansion.

lib/std/Thread/Pool.zig

+        if (connection) |fd| {
+            std.posix.close(fd);
+            connection = null;


The host accepts N simultaneous connections and writes 1 byte to them each. Clients connect and read 1 byte in order to obtain a thread token. std.Thread.Pool now lazily spawns threads only when the work queue is non-empty. I think that was a bad idea and will revert it shortly. There is now a std.zig.initThreadPool wrapper that deals with: * Resolving a zig cache directory into a UNIX domain socket address. * Creating the "tmp" directory in .zig-cache but only if the listen failed due to ENOENT. * Deciding to connect to an existing jobserver, or become the host for child processes.

std.Thread.Pool: back to spawning all threads in initialization because it's overall simpler. This scheme requires init to be passed a pointer to the struct. std.process.Child: implement integration with thread pool jobserver. The environment variable is called `JOBSERVERV2`. The API works based on assigning a thread pool to the child process. build runner: store the thread pool in std.Build.Graph so that it can be passed to child processes during the make phase. Fix not allocating +1 pollfds in previous commit.

TCP cannot be used with UNIX domain sockets

The main thread has an implicit thread token which makes loitering illegal.

andrewrk · 2024-07-01T21:44:17Z

I'm stuck.

I am observing one thread blocking on connect(), and then another thread calls shutdown() on the same sockfd, the first thread continues to block on the connect syscall: https://gist.github.com/andrewrk/bf46cff35d34692558563f75bc8d0e62

Any ideas on how to solve this? The problem is on deinitialization; the main thread needs a way to wake up the worker threads that are blocking on connect().

I need to understand

under what conditions a unix domain socket starts causing connect() to block, and
if such a connect() syscall is blocking, how to unblock it

and increase kernel backlog to the maximum number. Without increasing the kernel backlog to the maximum number, I observed connect() to block indefinitely, even when another thread calls shutdown() on that socket file descriptor.

andrewrk · 2024-07-01T22:45:08Z

I was able to prevent blocking on connect() by increasing the kernel backlog to the maximum allowed integer. I'm not sure whether this is an acceptable solution yet.

matklad · 2024-07-03T10:13:10Z

The problem is on deinitialization; the main thread needs a way to wake up the worker threads that are blocking on connect().

I think I've seen this (checks the calendar) 7 years ago when the original jobserver protocol was implemented by Cargo!

    /// # Platform-specific behavior
    ///
    /// On Windows this function behaves pretty normally as expected, but on
    /// Unix the implementation is... a little heinous. As mentioned above
    /// we're forced into blocking I/O for token acquisition, namely a blocking
    /// call to `read`. We must be able to unblock this, however, to tear down
    /// the helper thread gracefully!
    ///
    /// Essentially what happens is that we'll send a signal to the helper
    /// thread spawned and rely on `EINTR` being returned to wake up the helper
    /// thread. This involves installing a global `SIGUSR1` handler that does
    /// nothing along with sending signals to that thread. This may cause
    /// odd behavior in some applications, so it's recommended to review and
    /// test thoroughly before using this.

https://github.com/rust-lang/jobserver-rs/blob/438acebbb25e5b76c216097d30b48929e4a1e540/src/lib.rs#L529-L538

Not a good solution for sure.

But I think at that point Cargo was constrained by blocking IO (flipping the FD passed by make to cargo to non-blocking made the underlying file description non-blocking for make, which it obviously wasn't prepared for). We are not thusly constrained, so we could put the socket and some other notification mechanism in a poll?

joshtriplett · 2024-07-03T11:19:32Z

@matklad Unfortunately, putting a file descriptor into non-blocking mode doesn't make connect asynchronous for UNIX sockets. For other kinds of sockets, like TCP, connect will return EINPROGRESS and then you can use poll to see when the connection finishes. For a non-blocking UNIX socket, connect will fail with EAGAIN and give up, and you can't poll for completion because there's nothing in progress. I haven't found any reliable, portable way to do a non-blocking connect for a UNIX socket.

(Switching to TCP seems non-viable, as closing a socket can result in long timeouts rather than quick reliable error indications, and keepalive behavior is unreliable and more effort; we really do need something like a UNIX socket or named pipe here.)

andrewrk · 2024-07-21T08:40:05Z

Giving up on this for now. Might come back to it later. Turned out to be more frustrating than I anticipated.

andrewrk requested a review from kprotty as a code owner June 21, 2024 03:02

matklad reviewed Jun 21, 2024

View reviewed changes

lib/std/Thread/Pool.zig Outdated

Comment on lines 360 to 362

if (connection) |fd| {

std.posix.close(fd);

connection = null;

This comment was marked as resolved.

Sign in to view

This comment was marked as resolved.

Sign in to view

andrewrk added 5 commits June 30, 2024 13:54

std.Thread.Pool: fix connection

6d3a78b

TCP cannot be used with UNIX domain sockets

std.Thread.Pool: don't hold the job closure while blocking on IPC read

dc207da

make the main thread call waitAndWork instead of just wait

86e68db

The main thread has an implicit thread token which makes loitering illegal.

andrewrk force-pushed the threadpool branch from 070beae to a239e65 Compare July 1, 2024 22:44

andrewrk closed this Jul 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std.Thread.Pool: process tree cooperation via a new jobserver protocol #20372

std.Thread.Pool: process tree cooperation via a new jobserver protocol #20372

andrewrk commented Jun 21, 2024 •

edited

Loading

joshtriplett commented Jun 21, 2024

This comment was marked as resolved.

This comment was marked as resolved.

andrewrk commented Jul 1, 2024 •

edited

Loading

andrewrk commented Jul 1, 2024

matklad commented Jul 3, 2024 •

edited

Loading

joshtriplett commented Jul 3, 2024

andrewrk commented Jul 21, 2024

std.Thread.Pool: process tree cooperation via a new jobserver protocol #20372

std.Thread.Pool: process tree cooperation via a new jobserver protocol #20372

Conversation

andrewrk commented Jun 21, 2024 • edited Loading

Testing

Simple Example

real world example: my music player project

behavior tests

Justification for choosing JOBSERVERV2

Merge Checklist

joshtriplett commented Jun 21, 2024

This comment was marked as resolved.

This comment was marked as resolved.

andrewrk commented Jul 1, 2024 • edited Loading

andrewrk commented Jul 1, 2024

matklad commented Jul 3, 2024 • edited Loading

joshtriplett commented Jul 3, 2024

andrewrk commented Jul 21, 2024

andrewrk commented Jun 21, 2024 •

edited

Loading

Justification for choosing `JOBSERVERV2`

andrewrk commented Jul 1, 2024 •

edited

Loading

matklad commented Jul 3, 2024 •

edited

Loading