Introduce a ThreadPool and parallel execution of some of the compilation work items #7462

andrewrk · 2020-12-16T03:51:43Z

This is the results of a live coding stream, it needs to be cleaned up before merging.

Merge checklist:

Make std.Progress thread-safe
~~Parallelize more kinds of work items~~ this will be follow-up work
Initialize the ThreadPool in main.zig and pass it into Compilation explicity, not relying on threadlocal variables.
Proper shutdown of the ThreadPool
Audit the ThreadPool code and maybe put it in the std lib along with making WaitGroup support blocking I/O use cases
do introduce a thread-local Cryptographically Secure Pseudo Random Number Generator for general use #6704 as a prerequisite for this one
~~troubleshoot the aarch64 deadlock~~ I worked around it. Will address in a follow-up issue.

Edit: forgot to mention, thanks to @kprotty for the ThreadPool implementation :D

kprotty · 2020-12-16T14:36:05Z

Some API changes to consider:

Having rayon style usage where there's a global thread pool for convenience, along with the ability to make custom thread pools
Passing in schedule hints on spawn()/schedule() which can do things like use LIFO scheduling to improve throughput or bind the Runnable to certain OS threads to support threadlocal access
Have the thread pool more customized in regards to fork-join style workloads:
- init() function & start() function separate (similar to std.event.Loop. start() has option to use caller thread as a worker thread
- shutdown() + deinit()/join() to wait for all threads to exit or option t)o have them all exit when all submitted work is completed (wouldn't work for a global thread pool)

Some implementation changes to consider:

dynamic size of threads, they add themselves intrusively to the Pool on spawn (no need for allocating thread array)
Batch scheduling since the Runnables already act like intrusive linked-lists
Work-stealing run queue instead of centralized run queue (would complicate impl more so than the other suggestions)

Edit: specificities on some points

marler8997 · 2020-12-16T16:58:34Z

src/ThreadPool.zig

+        break :blk nodes;
+    };
+
+    while (idle_nodes.popFirst()) |node|


Ok here's a subtle bug that's hard to see. These idle_nodes are singly linked list nodes, which are allocated on the stacks of each worker. (see the runWorker function) However, by this line we have set is_shutdown to true and have released the mutex. This means all workers are now free to exit. So all these idle_nodes on the stack of each worker can be freed at any time, which includes the time before we get to this loop that accesses them. Also note that it's not just the next pointer that's invalid, but the AutoResetEvent itself is also allocated on the stack and could also be invalidated here. To fix this issue, and also be able to handle when a threads has to exit (i.e. via an abort), the AutoResetEvent should probably be allocated on the heap.

So all these idle_nodes on the stack of each worker can be freed at any time

The worker thread owner for each node in the idle_node list is blocked on its node's AutoResetEvent so it cannot exit until unblocked by .set(). This means even when the lock is released, the blocking threads keep the nodes alive and they remain valid until the line below which wakes them up that follows the shutdown action chain of: deallocating the Node from the stack, re-acquiring the pool lock, seeing shutdown, releasing lock and returning, then being joined by the main pool thread

The worker thread owner for each node in the idle_node list is blocked on its node's AutoResetEvent so it cannot exit until unblocked by .set()

A worker can exit at any time from an abort. In such a case, we don't want to corrupt the shared idle_queue when a single worker has to exit for some reason.

Also, if shutdown gets called after a worker's AutoResetEvent has been set, and before that worker has checked if shutdown has been called on line 114, then the worker can exit before the shutdown thread has accessed it's idle_node.

If you don't want a worker exit/abort to result in corruption of the idle_queue (which I think it shouldn't), then you pretty much have to allocate this metadata on the heap, there's really no other choice. Relying on the stack of each worker to keep our metadata intact is always going to be brittle.

A worker can exit at any time from an abort

The worker wouldn't be able to call abort() while its still in the idle_queue as it would be blocked on AutoResetEvent.wait(). It is only ever unblocked once it has been removed from the idle_queue (with the lock held) as shown in schedule() or shutdown(). If another thread were to call abort() however, then Id imagine that would kill all threads in the process rather than produce race.

checked if shutdown has been called on line 114, then the worker can exit before the shutdown thread has accessed it's idle_node.

A workers idle_node doesn't exist while its checking for shutdown. It only exist before and during its time in the idle_queue and while blocked on AutoResetEvent.

Relying on the stack of each worker to keep our metadata intact is always going to be brittle.

Why is this the case? The stack data always remains valid in the current code as explained above.

@kprotty ah sorry I misread the code. I was under the false impression that the thread removed itself from idle_queue.

So these 2 cases I mentioned should not be possible. However, there are still other cases where a worker thread can get killed. For example getting killed by OOM or manually by a user. With the code as it exists, this would result in use after free. The only way that I can think to solve this use after free is to allocate this data on the heap.

@marler8997 That is indeed a possibility, but I am on the side that a user messing with the internal state of code is already undefined behavior and shouldn't have to introduce runtime overhead to protect against.

OOM isn't the user intervening (Out of Memory Killer). I've gotten OOM errors with the Zig compiler, this is a real use case. By using references to the stacks of other threads, it causes errors in one thread to cascade to others in the form of use after free, which makes them extremely difficult to debug/triage/reproduce. Memory corruption can cause almost anything to happen. We can avoid these errors by putting shared data on the heap instead of the stack. This way, if/when a thread gets killed, it's errors will stay local to it instead of causing "use after free" errors on other threads as well.

Doesn't OOM kill the entire process as opposed to a given thread? If all threads die, then there's nothing to guard against. In the current code, a thread being randomly killed would only be an issue if its killed with its idle_node in the idle_queue. This currently can't happen without killing the entire process as the user has no access to the thread's id to send it a sigkill/sigterm and its only vulnerable while sleeping on the AutoResetEvent (which the user also doesn't have access to).

Edit: made the response contextual to this case rather than for generic panics.

After thinking on this more and playing with another thread pool implementation, I've decided I'm more ok with this solution.

As to why, the first reason is that this stack pointer to the worker thread's AutoResetEvent is stored in the idle queue for a very short "time", at least in terms of "number of instructions" executed on the worker thread. This stack reference is put into the shared idle queue just before the worker calls wait, then removed by another thread before it wakes up, so there's only a few instructions that the worker will even execute during this time window. The chances of it getting killed and us losing its stack are low. So I think in this case its ok to make an exception to the general avoidance of cross referencing stack memory between threads.

The second reason is that I realize the alternative solution comes with its own issues. If we stored all the AutoResetEvents in an array on the heap, then "false sharing" becomes a performance concern. "false sharing" could be addressed by spacing the AutoResetEvents far enough away to ensure they are on separate cache lines, but we can avoid this altogether by storing them on each thread's stack instead.

kprotty · 2020-12-16T17:48:03Z

src/Compilation.zig

@@ -1366,6 +1375,21 @@ pub fn getAllErrorsAlloc(self: *Compilation) !AllErrors {
    };
 }

+const WaitGroup = struct {


There might be a race here (edit: cc @marler8997)

WaitGroup.done() decrements counter to 0, gets preempted

WaitGroup.wait() sees 0 counter, exits, deallocates itself if on stack

WaitGroup.done() gets rescheduled, saw the 0 dec, tries to set event, event memory was already deallocated

Feel like you'd need sort of sort of futex wait comparison/validation mechanic to resolve this. Wrapping all the state checking in a Mutex would be the cross platform equivalent. Might be a good reason to introduce parking_lot in the stdlib.

Yeah I haven't reviewed all this code, but based on your comment it does look like a race.

P.S. It seems like reviewing multi-threaded scheduler code takes about 10 times longer than it does to review normal code. It's going to take me some time to really review all this code, even though it's less than 200 lines!

Yea, I plan to do a rewrite of this later today with the goal of simplifying the review process at the possible expense of concurrent perf.

edit: fwiw, this was written while racing andrew during the stream so I'm partially surprised it even worked for demo purposes given the previous bugs.

kprotty · 2020-12-18T14:38:55Z

src/WaitGroup.zig

+    var event = std.AutoResetEvent{};
+    @atomicStore(?*std.AutoResetEvent, &self.event, &event, .SeqCst);
+    if (@atomicLoad(usize, &self.counter, .SeqCst) != 0)
+        event.wait();


rip, another race on my part:

stop() decrements to 0, gets preempted

wait() loads counter, sees 0, exits and invalidates event memory

stop() gets scheduled, swaps null & sees valid event, tries to set it and accesses invalid memory

Fix is to ensure that stop no longer has access to event ptr in wait() when returning:

// Create and publish the event preemptively before checking the counter. // If we check before setting the event, then: // - we could see non-zero, get preempted // - stop() could set to zero, see no event, continue // - we get rescheduled, and sleep on the event after seeing non-zero above (deadlock) var event = std.AutoResetEvent{}; @atomicStore(?*std.AutoResetEvent, &self.event, &event, .SeqCst); // if the counter is non-zero, then the last stop() will see our event stored above and wake us up if (@atomicLoad(usize, &self.counter, .SeqCst) != 0) return event.wait(); // If the counter is zero, then we need to make sure we take out our event. // stop() could set zero, get preempted, we see 0, return, stop() reschedules and tries to wake invalid event. // // We address this by trying to remove the event from access to stop() // If we failed to remove it, that means that stop() has it and will try to wake us up. // If so, we wait for the wake up which is the indicator that stop() is no longer referencing this event. if (@atomicRmw(?*std.AutoResetEvent, &self.event, .Xchg, null, .SeqCst) == null) event.wait();

Given its not a hot path in Compilation, we could probably switch back to the lock-based version for auditing purposes

marler8997 · 2020-12-19T17:20:23Z

src/ThreadPool.zig

+    while (true) {
+        const held = self.lock.acquire();
+
+        if (self.run_queue.popFirst()) |run_node| {


Shouldn't you check is_running before checking for the next task? Couldn't shutdown be called before all tasks are completed? For example, if an errors occurs we may want to shutdown all the threads?

If this errdefer did get run it would constitute a race condition. So I deleted the dead code for clarity.

We generally get away with atomic primitives, however a lock is required around the refresh function since it traverses the Node graph, and we need to be sure no references to Nodes remain after end() is called.

thx king protty

And enable it for Drone CI. I hate to do this, but I need to make progress on other fronts.

marler8997 reviewed Dec 16, 2020

View reviewed changes

kprotty reviewed Dec 16, 2020

View reviewed changes

andrewrk force-pushed the parallel-c-objects branch 2 times, most recently from ee98c9c to 160f800 Compare December 18, 2020 06:12

kprotty reviewed Dec 18, 2020

View reviewed changes

andrewrk force-pushed the parallel-c-objects branch from 160f800 to b2ac6c6 Compare December 19, 2020 01:54

andrewrk marked this pull request as ready for review December 19, 2020 04:22

andrewrk force-pushed the parallel-c-objects branch from 3ef7644 to ab01aab Compare December 19, 2020 04:58

marler8997 reviewed Dec 19, 2020

View reviewed changes

andrewrk force-pushed the parallel-c-objects branch from ef0bd4d to be3638e Compare December 19, 2020 22:40

andrewrk added 10 commits December 20, 2020 15:08

stage2: protect mutable state from data races in updateCObject

01d3385

use kprotty's ThreadPool implementation (v5)

0d1cd0d

stage2: replace WaitGroup with a trivially auditable one

32fd637

ThreadPool: delete dead code

b2f8631

If this errdefer did get run it would constitute a race condition. So I deleted the dead code for clarity.

std.Progress: make the API thread-safe

aa6ef10

We generally get away with atomic primitives, however a lock is required around the refresh function since it traverses the Node graph, and we need to be sure no references to Nodes remain after end() is called.

update stage2 test harness to new std.Progress API

e00b6db

workaround for std lib AutoResetEvent bug

4e621d4

std.Progress: fix atomic ordering semantics

fbcffe9

thx king protty

update WaitGroup to yet another version

10d3083

add an option to compile zig in single-threaded mode

1d94a68

And enable it for Drone CI. I hate to do this, but I need to make progress on other fronts.

andrewrk force-pushed the parallel-c-objects branch from ad50ec6 to 1d94a68 Compare December 20, 2020 22:40

andrewrk merged commit 4918605 into master Dec 21, 2020

andrewrk deleted the parallel-c-objects branch December 21, 2020 02:19

mikdusan added the release notes This PR should be mentioned in the release notes. label Jan 24, 2021

andrewrk added this to the 0.9.0 milestone Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a ThreadPool and parallel execution of some of the compilation work items #7462

Introduce a ThreadPool and parallel execution of some of the compilation work items #7462

andrewrk commented Dec 16, 2020 •

edited

Loading

kprotty commented Dec 16, 2020 •

edited

Loading

marler8997 Dec 16, 2020 •

edited

Loading

kprotty Dec 16, 2020

marler8997 Dec 16, 2020 •

edited

Loading

kprotty Dec 16, 2020

marler8997 Dec 16, 2020

kprotty Dec 16, 2020

marler8997 Dec 17, 2020

kprotty Dec 17, 2020 •

edited

Loading

marler8997 Dec 19, 2020

kprotty Dec 16, 2020 •

edited

Loading

kprotty Dec 16, 2020

marler8997 Dec 16, 2020

kprotty Dec 16, 2020 •

edited

Loading

kprotty Dec 18, 2020

kprotty Dec 18, 2020

marler8997 Dec 19, 2020

Introduce a ThreadPool and parallel execution of some of the compilation work items #7462

Introduce a ThreadPool and parallel execution of some of the compilation work items #7462

Conversation

andrewrk commented Dec 16, 2020 • edited Loading

kprotty commented Dec 16, 2020 • edited Loading

marler8997 Dec 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marler8997 Dec 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kprotty Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kprotty Dec 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kprotty Dec 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewrk commented Dec 16, 2020 •

edited

Loading

kprotty commented Dec 16, 2020 •

edited

Loading

marler8997 Dec 16, 2020 •

edited

Loading

marler8997 Dec 16, 2020 •

edited

Loading

kprotty Dec 17, 2020 •

edited

Loading

kprotty Dec 16, 2020 •

edited

Loading

kprotty Dec 16, 2020 •

edited

Loading