Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a ThreadPool and parallel execution of some of the compilation work items #7462

Merged
merged 10 commits into from
Dec 21, 2020

Conversation

andrewrk
Copy link
Member

@andrewrk andrewrk commented Dec 16, 2020

This is the results of a live coding stream, it needs to be cleaned up before merging.

Merge checklist:

  • Make std.Progress thread-safe
  • Parallelize more kinds of work items this will be follow-up work
  • Initialize the ThreadPool in main.zig and pass it into Compilation explicity, not relying on threadlocal variables.
  • Proper shutdown of the ThreadPool
  • Audit the ThreadPool code and maybe put it in the std lib along with making WaitGroup support blocking I/O use cases
  • do introduce a thread-local Cryptographically Secure Pseudo Random Number Generator for general use #6704 as a prerequisite for this one
  • troubleshoot the aarch64 deadlock I worked around it. Will address in a follow-up issue.

Edit: forgot to mention, thanks to @kprotty for the ThreadPool implementation :D

@kprotty
Copy link
Member

kprotty commented Dec 16, 2020

Some API changes to consider:

  • Having rayon style usage where there's a global thread pool for convenience, along with the ability to make custom thread pools
  • Passing in schedule hints on spawn()/schedule() which can do things like use LIFO scheduling to improve throughput or bind the Runnable to certain OS threads to support threadlocal access
  • Have the thread pool more customized in regards to fork-join style workloads:
    • init() function & start() function separate (similar to std.event.Loop. start() has option to use caller thread as a worker thread
    • shutdown() + deinit()/join() to wait for all threads to exit or option t)o have them all exit when all submitted work is completed (wouldn't work for a global thread pool)

Some implementation changes to consider:

  • dynamic size of threads, they add themselves intrusively to the Pool on spawn (no need for allocating thread array)
  • Batch scheduling since the Runnables already act like intrusive linked-lists
  • Work-stealing run queue instead of centralized run queue (would complicate impl more so than the other suggestions)

Edit: specificities on some points

break :blk nodes;
};

while (idle_nodes.popFirst()) |node|
Copy link
Contributor

@marler8997 marler8997 Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok here's a subtle bug that's hard to see. These idle_nodes are singly linked list nodes, which are allocated on the stacks of each worker. (see the runWorker function) However, by this line we have set is_shutdown to true and have released the mutex. This means all workers are now free to exit. So all these idle_nodes on the stack of each worker can be freed at any time, which includes the time before we get to this loop that accesses them. Also note that it's not just the next pointer that's invalid, but the AutoResetEvent itself is also allocated on the stack and could also be invalidated here. To fix this issue, and also be able to handle when a threads has to exit (i.e. via an abort), the AutoResetEvent should probably be allocated on the heap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all these idle_nodes on the stack of each worker can be freed at any time

The worker thread owner for each node in the idle_node list is blocked on its node's AutoResetEvent so it cannot exit until unblocked by .set(). This means even when the lock is released, the blocking threads keep the nodes alive and they remain valid until the line below which wakes them up that follows the shutdown action chain of: deallocating the Node from the stack, re-acquiring the pool lock, seeing shutdown, releasing lock and returning, then being joined by the main pool thread

Copy link
Contributor

@marler8997 marler8997 Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker thread owner for each node in the idle_node list is blocked on its node's AutoResetEvent so it cannot exit until unblocked by .set()

A worker can exit at any time from an abort. In such a case, we don't want to corrupt the shared idle_queue when a single worker has to exit for some reason.

Also, if shutdown gets called after a worker's AutoResetEvent has been set, and before that worker has checked if shutdown has been called on line 114, then the worker can exit before the shutdown thread has accessed it's idle_node.

If you don't want a worker exit/abort to result in corruption of the idle_queue (which I think it shouldn't), then you pretty much have to allocate this metadata on the heap, there's really no other choice. Relying on the stack of each worker to keep our metadata intact is always going to be brittle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A worker can exit at any time from an abort

The worker wouldn't be able to call abort() while its still in the idle_queue as it would be blocked on AutoResetEvent.wait(). It is only ever unblocked once it has been removed from the idle_queue (with the lock held) as shown in schedule() or shutdown(). If another thread were to call abort() however, then Id imagine that would kill all threads in the process rather than produce race.

checked if shutdown has been called on line 114, then the worker can exit before the shutdown thread has accessed it's idle_node.

A workers idle_node doesn't exist while its checking for shutdown. It only exist before and during its time in the idle_queue and while blocked on AutoResetEvent.

Relying on the stack of each worker to keep our metadata intact is always going to be brittle.

Why is this the case? The stack data always remains valid in the current code as explained above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kprotty ah sorry I misread the code. I was under the false impression that the thread removed itself from idle_queue.

So these 2 cases I mentioned should not be possible. However, there are still other cases where a worker thread can get killed. For example getting killed by OOM or manually by a user. With the code as it exists, this would result in use after free. The only way that I can think to solve this use after free is to allocate this data on the heap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marler8997 That is indeed a possibility, but I am on the side that a user messing with the internal state of code is already undefined behavior and shouldn't have to introduce runtime overhead to protect against.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOM isn't the user intervening (Out of Memory Killer). I've gotten OOM errors with the Zig compiler, this is a real use case. By using references to the stacks of other threads, it causes errors in one thread to cascade to others in the form of use after free, which makes them extremely difficult to debug/triage/reproduce. Memory corruption can cause almost anything to happen. We can avoid these errors by putting shared data on the heap instead of the stack. This way, if/when a thread gets killed, it's errors will stay local to it instead of causing "use after free" errors on other threads as well.

Copy link
Member

@kprotty kprotty Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't OOM kill the entire process as opposed to a given thread? If all threads die, then there's nothing to guard against. In the current code, a thread being randomly killed would only be an issue if its killed with its idle_node in the idle_queue. This currently can't happen without killing the entire process as the user has no access to the thread's id to send it a sigkill/sigterm and its only vulnerable while sleeping on the AutoResetEvent (which the user also doesn't have access to).

Edit: made the response contextual to this case rather than for generic panics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking on this more and playing with another thread pool implementation, I've decided I'm more ok with this solution.

As to why, the first reason is that this stack pointer to the worker thread's AutoResetEvent is stored in the idle queue for a very short "time", at least in terms of "number of instructions" executed on the worker thread. This stack reference is put into the shared idle queue just before the worker calls wait, then removed by another thread before it wakes up, so there's only a few instructions that the worker will even execute during this time window. The chances of it getting killed and us losing its stack are low. So I think in this case its ok to make an exception to the general avoidance of cross referencing stack memory between threads.

The second reason is that I realize the alternative solution comes with its own issues. If we stored all the AutoResetEvents in an array on the heap, then "false sharing" becomes a performance concern. "false sharing" could be addressed by spacing the AutoResetEvents far enough away to ensure they are on separate cache lines, but we can avoid this altogether by storing them on each thread's stack instead.

@@ -1366,6 +1375,21 @@ pub fn getAllErrorsAlloc(self: *Compilation) !AllErrors {
};
}

const WaitGroup = struct {
Copy link
Member

@kprotty kprotty Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a race here (edit: cc @marler8997)

  • WaitGroup.done() decrements counter to 0, gets preempted
  • WaitGroup.wait() sees 0 counter, exits, deallocates itself if on stack
  • WaitGroup.done() gets rescheduled, saw the 0 dec, tries to set event, event memory was already deallocated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like you'd need sort of sort of futex wait comparison/validation mechanic to resolve this. Wrapping all the state checking in a Mutex would be the cross platform equivalent. Might be a good reason to introduce parking_lot in the stdlib.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I haven't reviewed all this code, but based on your comment it does look like a race.

P.S. It seems like reviewing multi-threaded scheduler code takes about 10 times longer than it does to review normal code. It's going to take me some time to really review all this code, even though it's less than 200 lines!

Copy link
Member

@kprotty kprotty Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I plan to do a rewrite of this later today with the goal of simplifying the review process at the possible expense of concurrent perf.

edit: fwiw, this was written while racing andrew during the stream so I'm partially surprised it even worked for demo purposes given the previous bugs.

@andrewrk andrewrk force-pushed the parallel-c-objects branch 2 times, most recently from ee98c9c to 160f800 Compare December 18, 2020 06:12
var event = std.AutoResetEvent{};
@atomicStore(?*std.AutoResetEvent, &self.event, &event, .SeqCst);
if (@atomicLoad(usize, &self.counter, .SeqCst) != 0)
event.wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rip, another race on my part:

  • stop() decrements to 0, gets preempted
  • wait() loads counter, sees 0, exits and invalidates event memory
  • stop() gets scheduled, swaps null & sees valid event, tries to set it and accesses invalid memory

Fix is to ensure that stop no longer has access to event ptr in wait() when returning:

// Create and publish the event preemptively before checking the counter.
// If we check before setting the event, then:
// - we could see non-zero, get preempted
// - stop() could set to zero, see no event, continue
// - we get rescheduled, and sleep on the event after seeing non-zero above (deadlock)
var event = std.AutoResetEvent{};
@atomicStore(?*std.AutoResetEvent, &self.event, &event, .SeqCst);

// if the counter is non-zero, then the last stop() will see our event stored above and wake us up
if (@atomicLoad(usize, &self.counter, .SeqCst) != 0)
    return event.wait();
    
// If the counter is zero, then we need to make sure we take out our event.
// stop() could set zero, get preempted, we see 0, return, stop() reschedules and tries to wake invalid event.
//
// We address this by trying to remove the event from access to stop()
// If we failed to remove it, that means that stop() has it and will try to wake us up.
// If so, we wait for the wake up which is the indicator that stop() is no longer referencing this event.
if (@atomicRmw(?*std.AutoResetEvent, &self.event, .Xchg, null, .SeqCst) == null)
    event.wait();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given its not a hot path in Compilation, we could probably switch back to the lock-based version for auditing purposes

@andrewrk andrewrk marked this pull request as ready for review December 19, 2020 04:22
while (true) {
const held = self.lock.acquire();

if (self.run_queue.popFirst()) |run_node| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you check is_running before checking for the next task? Couldn't shutdown be called before all tasks are completed? For example, if an errors occurs we may want to shutdown all the threads?

If this errdefer did get run it would constitute a race condition. So I
deleted the dead code for clarity.
We generally get away with atomic primitives, however a lock is required
around the refresh function since it traverses the Node graph, and we
need to be sure no references to Nodes remain after end() is called.
And enable it for Drone CI. I hate to do this, but I need to make
progress on other fronts.
@andrewrk andrewrk merged commit 4918605 into master Dec 21, 2020
@andrewrk andrewrk deleted the parallel-c-objects branch December 21, 2020 02:19
@mikdusan mikdusan added the release notes This PR should be mentioned in the release notes. label Jan 24, 2021
@andrewrk andrewrk added this to the 0.9.0 milestone Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes This PR should be mentioned in the release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants