Improve scheduler performance by reducing lock contention #111

psalz · 2022-03-30T19:19:37Z

This PR is based on #109 so that one should be merged first.

This improves the performance of the scheduler in two ways:

Address a long standing issue (we've had a TODO note about this for years...) where the scheduler thread would hold the lock on the events queue until a task was fully built and flushed, effectively causing it to operate in lockstep with the main thread. This has been fixed by double-buffering the event queue, with the scheduler thread only locking the mutex to swap the two.
Reduce contention around the task_manager's mutex by passing a task pointer to the scheduler directly, instead of it repeatedly calling task_manager::get_task (while the main thread is trying to acquire the lock for producing new tasks).

Additionally I've reduced the number of tasks generated in the "soup" topology benchmark so that the timings are in about the same order of magnitude as for the other topologies (= fits better onto a chart).

Here is a comparison between the original "lockstep" scheduler and the double-buffered "decoupled" one, but still with frequent calls to get_task. There are two bars for each benchmark in each group (same colors), with the right bar representing the decoupled scheduler. The two "ST *" groups refer to the single-threaded baseline implementation that generates commands in the main thread, circumventing the scheduler altogether (numbers in these groups should not be affected by the changes).

We can see that this is an easy win overall, with numbers improving across the board, in particular for the "soup" and "wave_sim" topologies. Additionally jitter is greatly reduced.

Unfortunately numbers are not quite as clear for the second change. Here we now compare the decoupled scheduler against the task pointer being passed around (the left bar here is the right bar above):

There are again considerable improvements for "soup" and "wave_sim", which now also outperform the single-threaded variant in most cases. Jitter is again reduced across the board. However with 4 nodes, there are some numbers that are worse, in particular for the "Immediate" group. While I was able to replicate this several times, I've also encountered an instance where performance was better everywhere, with the "Immediate" group outperforming "ST Immediate" for most topologies. I therefore attribute this "worsening" to overall jitter in our benchmarking setup, and given the significance of the "soup" and "wave_sim" results, I would still consider this an improvement overall.

fknorr

Great improvement, thanks! Even if there is not always a benefit to the single-threaded setup, this is an important requirement for any parallel CDAG generation we might want to do in the future.

Could you add another bar chart for the TDAG times as a baseline to see what the best-case improvement of the separate scheduler thread would be?

include/scheduler.h

src/scheduler.cc

include/graph_generator.h

include/scheduler.h

src/scheduler.cc

psalz · 2022-03-31T11:47:29Z

Could you add another bar chart for the TDAG times as a baseline to see what the best-case improvement of the separate scheduler thread would be?

Yes! I've plotted the single-threaded TDAG+CDAG times against the dedicated scheduler thread. If we had perfect overlapping, the scheduler times would be equal to the single-threaded CDAG times. For 4 nodes we are not that far off!

fknorr · 2022-04-02T13:42:37Z

Thanks! LGTM up to the in_flight_count remark. Some notes for future investigation into scheduler-thread performance:

It's remarkable how far results differ between topologies. It seems intuitive that the benefit should be maximal when TDAG and CDAG generation are most similar in execution time (soup in both graphs, wave_sim in the single-node case). I assume that the positive result for wave_sim in the 4-node example is simply due to jitter, because in no other scenario can the TDAG generation latency be hidden completely.

To me it seems like the 1-node soup should be the perfect condition for parallelization since TDAG and CDAG can be almost fully overlapped and the total runtime is long enough to avoid any constant overhead that might be visible in faster runs (e.g. chain). Yet we only save about 23% instead of the theoretical 45%. My guess is that since each individual task is very quick to compute, the scheduler has to block on its queue after every task, so we add an OS scheduler round-trip to each iteration.

Also it is notable that for 1-node soup, TDAG generation is more expensive than CDAG. Both generate mostly identical graphs, but the CDAG is also serialized (and contains the infamous debug_label), so I would expect it to be the other way round. Possible causes that come to mind:

Allocations for command group type erasure and range mappers
task_mutex, locked and unlocked for every task – but it is not contended here, so locking should never need a syscall

psalz · 2022-07-19T12:16:16Z

I've rebased onto master, which now includes the task ring buffer. Due to this the second optimization (passing around a task-pointer) is moot, as task_manager::get_task is now lock-free. Nevertheless, I think it's still a good thing to do.

With the ring buffer times are lower in general and the results after decoupling the scheduler are a lot less dramatic. Improvements are most noticeable for tasks where computing the command graph isn't cheap, e.g. wave_sim.

Note besides decreasing the number of tasks for the "soup" benchmark (as described above), I've now also changed the benchmarks to use the task-manager callback to notify the scheduler of newly created tasks. This has the effect that now horizons will also be built by the benchmark (previously they were silently ignored).

fknorr

LGTM! Some day we should do a pass over both task and command graph code replacing DAG node pointers with references where they are guaranteed to be non-null.

PeterTh

LGTM, other than the comment on going a bit farther with the task_id -> task refactoring (could also be a separate PR).

include/scheduler.h

include/task_manager.h

src/task_manager.cc

This brings the timings more in line with the other benchmarks (better for fitting them on the same plot).

Previously we manually informed the scheduler of newly created tasks. This meant that horizons, which are generated automatically, were not considered. Benchmark times are worse as now more work is being done.

Don't hold mutex while generating commands. Swap between two sets of event queues to minimize the number of times the scheduler thread needs to lock the mutex.

fknorr reviewed Mar 31, 2022

View reviewed changes

fknorr approved these changes Apr 2, 2022

View reviewed changes

PeterTh assigned psalz Apr 4, 2022

psalz force-pushed the decouple-scheduler branch from 2270306 to 6459896 Compare July 19, 2022 11:02

psalz requested review from PeterTh and fknorr July 19, 2022 12:16

psalz force-pushed the decouple-scheduler branch from 6459896 to 4819834 Compare July 19, 2022 13:22

fknorr approved these changes Jul 20, 2022

View reviewed changes

PeterTh approved these changes Jul 22, 2022

View reviewed changes

include/scheduler.h Outdated Show resolved Hide resolved

include/task_manager.h Outdated Show resolved Hide resolved

psalz force-pushed the decouple-scheduler branch from 4819834 to 9a6a200 Compare July 25, 2022 10:49

PeterTh approved these changes Jul 25, 2022

View reviewed changes

src/task_manager.cc Outdated Show resolved Hide resolved

src/task_manager.cc Outdated Show resolved Hide resolved

psalz added 4 commits July 26, 2022 10:31

Reduce number of nodes in soup scheduler benchmark

7ea5a7d

This brings the timings more in line with the other benchmarks (better for fitting them on the same plot).

Use task_manager callback in command graph benchmarks

8eecc35

Previously we manually informed the scheduler of newly created tasks. This meant that horizons, which are generated automatically, were not considered. Benchmark times are worse as now more work is being done.

Decouple scheduler thread from main thread

8ff87cf

Don't hold mutex while generating commands. Swap between two sets of event queues to minimize the number of times the scheduler thread needs to lock the mutex.

Pass task pointers around instead of ids

29451b9

psalz force-pushed the decouple-scheduler branch from 6c9cb68 to 29451b9 Compare July 26, 2022 08:32

psalz merged commit 60a5b74 into master Jul 26, 2022

psalz deleted the decouple-scheduler branch July 26, 2022 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve scheduler performance by reducing lock contention #111

Improve scheduler performance by reducing lock contention #111

psalz commented Mar 30, 2022

fknorr left a comment

psalz commented Mar 31, 2022

fknorr commented Apr 2, 2022

psalz commented Jul 19, 2022

fknorr left a comment

PeterTh left a comment

Improve scheduler performance by reducing lock contention #111

Improve scheduler performance by reducing lock contention #111

Conversation

psalz commented Mar 30, 2022

fknorr left a comment

Choose a reason for hiding this comment

psalz commented Mar 31, 2022

fknorr commented Apr 2, 2022

psalz commented Jul 19, 2022

fknorr left a comment

Choose a reason for hiding this comment

PeterTh left a comment

Choose a reason for hiding this comment