-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std.Thread.Pool: process tree cooperation via a new jobserver protocol #20372
Conversation
👍 for embedding the version in the environment variable name. I think this will make it easier to work with for consumers of the protocol. Suggestion: Please define in the protocol that the byte written must be |
lib/std/Thread/Pool.zig
Outdated
if (connection) |fd| { | ||
std.posix.close(fd); | ||
connection = null; |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
The host accepts N simultaneous connections and writes 1 byte to them each. Clients connect and read 1 byte in order to obtain a thread token. std.Thread.Pool now lazily spawns threads only when the work queue is non-empty. I think that was a bad idea and will revert it shortly. There is now a std.zig.initThreadPool wrapper that deals with: * Resolving a zig cache directory into a UNIX domain socket address. * Creating the "tmp" directory in .zig-cache but only if the listen failed due to ENOENT. * Deciding to connect to an existing jobserver, or become the host for child processes.
std.Thread.Pool: back to spawning all threads in initialization because it's overall simpler. This scheme requires init to be passed a pointer to the struct. std.process.Child: implement integration with thread pool jobserver. The environment variable is called `JOBSERVERV2`. The API works based on assigning a thread pool to the child process. build runner: store the thread pool in std.Build.Graph so that it can be passed to child processes during the make phase. Fix not allocating +1 pollfds in previous commit.
TCP cannot be used with UNIX domain sockets
The main thread has an implicit thread token which makes loitering illegal.
I'm stuck. I am observing one thread blocking on connect(), and then another thread calls shutdown() on the same sockfd, the first thread continues to block on the connect syscall: https://gist.github.com/andrewrk/bf46cff35d34692558563f75bc8d0e62 Any ideas on how to solve this? The problem is on deinitialization; the main thread needs a way to wake up the worker threads that are blocking on connect(). I need to understand
|
and increase kernel backlog to the maximum number. Without increasing the kernel backlog to the maximum number, I observed connect() to block indefinitely, even when another thread calls shutdown() on that socket file descriptor.
I was able to prevent blocking on connect() by increasing the kernel backlog to the maximum allowed integer. I'm not sure whether this is an acceptable solution yet. |
I think I've seen this (checks the calendar) 7 years ago when the original jobserver protocol was implemented by Cargo!
Not a good solution for sure. But I think at that point Cargo was constrained by blocking IO (flipping the FD passed by make to cargo to non-blocking made the underlying file description non-blocking for make, which it obviously wasn't prepared for). We are not thusly constrained, so we could put the socket and some other notification mechanism in a poll? |
@matklad Unfortunately, putting a file descriptor into non-blocking mode doesn't make (Switching to TCP seems non-viable, as closing a socket can result in long timeouts rather than quick reliable error indications, and keepalive behavior is unreliable and more effort; we really do need something like a UNIX socket or named pipe here.) |
Giving up on this for now. Might come back to it later. Turned out to be more frustrating than I anticipated. |
I'm tentatively calling this new standard "Jobserver V2" and it uses the environment variable
JOBSERVERV2
. The idea is to change the environment variable name with any ABI-breaking changes that need to be made after the protocol starts being used.In summary, the root process in a process tree finds out that it is the root process because of the lack of
JOBSERVERV2
environment variable. In this case it spawns a thread with the same lifetime as itsstd.Thread.Pool
, which acts as a daemon, accepting up to N simultaneous connections where N is the thread pool size. It also writes 1 byte to each connection, in order to ensure blocking read() on the other side of the socket.When a thread pool finds the
JOBSERVERV2
environment variable, instead of being a host, it uses that path to obtain thread tokens before doing work. Before starting work, it connects to the daemon and reads 1 byte. Then it proceeds to do as much work as possible before closing the connection and going back to sleep. Crucially, when a process dies, the operating system closes all the connections it had open.closes #20274
Related:
Testing
Simple Example
demo: https://asciinema.org/a/lMKZ0YZceqBLsww419YxFxjxT
source: https://gist.github.com/andrewrk/985de049693100d30e15ab3940bf601f
real world example: my music player project
It's fun to look at but performance wise it seems to be a wash. 66 seconds to build before, 66 seconds to build after. But, I suppose since it does not spawn as many concurrent processes, it is using fewer resources during that time, so maybe that's a tiny win after all.
demo: https://asciinema.org/a/dDfUwIH78FQSegQakp3y5WXDC
source: https://codeberg.org/andrewrk/player
behavior tests
before: 1m5s, 889 MiB peak RSS
after: 1m4s, 893 MiB peak RSS
demo: https://asciinema.org/a/MEkXcmpBkPejfc0BngYsjRLDk
In the demo you can see that it does not give more thread tokens out when it should. Perhaps a symptom of the implicit global token problem.
Justification for choosing
JOBSERVERV2
Point 1: it is project-neutral, so other projects hopefully will consider it
Point 2: it implies a versioning scheme; if we have to break the ABI in the future, the 2 can become a 3.
Point 3: it is unique:
Merge Checklist
shutdown
on the read socket to stop blocked workers on deinitshutdown
be reused withconnect
?waitAndWork
is implemented poorly - after it starts waiting, it won't see any newly queued work