Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.Thread.Pool: process tree cooperation via a new jobserver protocol #20372

Closed
wants to merge 6 commits into from

Conversation

andrewrk
Copy link
Member

@andrewrk andrewrk commented Jun 21, 2024

I'm tentatively calling this new standard "Jobserver V2" and it uses the environment variable JOBSERVERV2. The idea is to change the environment variable name with any ABI-breaking changes that need to be made after the protocol starts being used.

In summary, the root process in a process tree finds out that it is the root process because of the lack of JOBSERVERV2 environment variable. In this case it spawns a thread with the same lifetime as its std.Thread.Pool, which acts as a daemon, accepting up to N simultaneous connections where N is the thread pool size. It also writes 1 byte to each connection, in order to ensure blocking read() on the other side of the socket.

When a thread pool finds the JOBSERVERV2 environment variable, instead of being a host, it uses that path to obtain thread tokens before doing work. Before starting work, it connects to the daemon and reads 1 byte. Then it proceeds to do as much work as possible before closing the connection and going back to sleep. Crucially, when a process dies, the operating system closes all the connections it had open.

closes #20274

Related:

Testing

Simple Example

demo: https://asciinema.org/a/lMKZ0YZceqBLsww419YxFxjxT
source: https://gist.github.com/andrewrk/985de049693100d30e15ab3940bf601f

real world example: my music player project

It's fun to look at but performance wise it seems to be a wash. 66 seconds to build before, 66 seconds to build after. But, I suppose since it does not spawn as many concurrent processes, it is using fewer resources during that time, so maybe that's a tiny win after all.

demo: https://asciinema.org/a/dDfUwIH78FQSegQakp3y5WXDC
source: https://codeberg.org/andrewrk/player

behavior tests

before: 1m5s, 889 MiB peak RSS
after: 1m4s, 893 MiB peak RSS
demo: https://asciinema.org/a/MEkXcmpBkPejfc0BngYsjRLDk

In the demo you can see that it does not give more thread tokens out when it should. Perhaps a symptom of the implicit global token problem.

Justification for choosing JOBSERVERV2

Point 1: it is project-neutral, so other projects hopefully will consider it

Point 2: it implies a versioning scheme; if we have to break the ABI in the future, the 2 can become a 3.

Point 3: it is unique:

image

image

Merge Checklist

  • Call shutdown on the read socket to stop blocked workers on deinit
    • can a sockfd which has been shutdown be reused with connect ?
  • Solve the implicit global token problem
    • waitAndWork is implemented poorly - after it starts waiting, it won't see any newly queued work
    • spawn N threads instead of N - 1, but have one of the workers be special
  • Performance testing
  • Fix the dependencies on posix that I added in various places
  • Implement Windows support
  • Should we use libdispatch on macOS?
  • Get feedback from third party projects such as GNU make, ninja, Meson, etc.

@andrewrk andrewrk requested a review from kprotty as a code owner June 21, 2024 03:02
@joshtriplett
Copy link

👍 for embedding the version in the environment variable name. I think this will make it easier to work with for consumers of the protocol.

Suggestion: Please define in the protocol that the byte written must be 0, to allow for the possibility of future expansion.

Comment on lines 360 to 362
if (connection) |fd| {
std.posix.close(fd);
connection = null;

This comment was marked as resolved.

This comment was marked as resolved.

The host accepts N simultaneous connections and writes 1 byte to them
each. Clients connect and read 1 byte in order to obtain a thread token.

std.Thread.Pool now lazily spawns threads only when the work queue is
non-empty. I think that was a bad idea and will revert it shortly.

There is now a std.zig.initThreadPool wrapper that deals with:
* Resolving a zig cache directory into a UNIX domain socket address.
* Creating the "tmp" directory in .zig-cache but only if the listen
  failed due to ENOENT.
* Deciding to connect to an existing jobserver, or become the host for
  child processes.
std.Thread.Pool: back to spawning all threads in initialization because
it's overall simpler. This scheme requires init to be passed a pointer
to the struct.

std.process.Child: implement integration with thread pool jobserver. The
environment variable is called `JOBSERVERV2`. The API works based on
assigning a thread pool to the child process.

build runner: store the thread pool in std.Build.Graph so that it can be
passed to child processes during the make phase.

Fix not allocating +1 pollfds in previous commit.
TCP cannot be used with UNIX domain sockets
The main thread has an implicit thread token which makes loitering
illegal.
@andrewrk
Copy link
Member Author

andrewrk commented Jul 1, 2024

I'm stuck.

I am observing one thread blocking on connect(), and then another thread calls shutdown() on the same sockfd, the first thread continues to block on the connect syscall: https://gist.github.com/andrewrk/bf46cff35d34692558563f75bc8d0e62

Any ideas on how to solve this? The problem is on deinitialization; the main thread needs a way to wake up the worker threads that are blocking on connect().

I need to understand

  1. under what conditions a unix domain socket starts causing connect() to block, and
  2. if such a connect() syscall is blocking, how to unblock it

and increase kernel backlog to the maximum number. Without increasing
the kernel backlog to the maximum number, I observed connect() to block
indefinitely, even when another thread calls shutdown() on that socket
file descriptor.
@andrewrk
Copy link
Member Author

andrewrk commented Jul 1, 2024

I was able to prevent blocking on connect() by increasing the kernel backlog to the maximum allowed integer. I'm not sure whether this is an acceptable solution yet.

@matklad
Copy link
Contributor

matklad commented Jul 3, 2024

The problem is on deinitialization; the main thread needs a way to wake up the worker threads that are blocking on connect().

I think I've seen this (checks the calendar) 7 years ago when the original jobserver protocol was implemented by Cargo!

    /// # Platform-specific behavior
    ///
    /// On Windows this function behaves pretty normally as expected, but on
    /// Unix the implementation is... a little heinous. As mentioned above
    /// we're forced into blocking I/O for token acquisition, namely a blocking
    /// call to `read`. We must be able to unblock this, however, to tear down
    /// the helper thread gracefully!
    ///
    /// Essentially what happens is that we'll send a signal to the helper
    /// thread spawned and rely on `EINTR` being returned to wake up the helper
    /// thread. This involves installing a global `SIGUSR1` handler that does
    /// nothing along with sending signals to that thread. This may cause
    /// odd behavior in some applications, so it's recommended to review and
    /// test thoroughly before using this.

https://github.com/rust-lang/jobserver-rs/blob/438acebbb25e5b76c216097d30b48929e4a1e540/src/lib.rs#L529-L538

Not a good solution for sure.

But I think at that point Cargo was constrained by blocking IO (flipping the FD passed by make to cargo to non-blocking made the underlying file description non-blocking for make, which it obviously wasn't prepared for). We are not thusly constrained, so we could put the socket and some other notification mechanism in a poll?

@joshtriplett
Copy link

@matklad Unfortunately, putting a file descriptor into non-blocking mode doesn't make connect asynchronous for UNIX sockets. For other kinds of sockets, like TCP, connect will return EINPROGRESS and then you can use poll to see when the connection finishes. For a non-blocking UNIX socket, connect will fail with EAGAIN and give up, and you can't poll for completion because there's nothing in progress. I haven't found any reliable, portable way to do a non-blocking connect for a UNIX socket.

(Switching to TCP seems non-viable, as closing a socket can result in long timeouts rather than quick reliable error indications, and keepalive behavior is unreliable and more effort; we really do need something like a UNIX socket or named pipe here.)

@andrewrk
Copy link
Member Author

Giving up on this for now. Might come back to it later. Turned out to be more frustrating than I anticipated.

@andrewrk andrewrk closed this Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

std.Thread.Pool: process tree cooperation
3 participants