Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve scheduler performance by reducing lock contention #111

Merged
merged 4 commits into from
Jul 26, 2022

Conversation

psalz
Copy link
Member

@psalz psalz commented Mar 30, 2022

This PR is based on #109 so that one should be merged first.

This improves the performance of the scheduler in two ways:

  • Address a long standing issue (we've had a TODO note about this for years...) where the scheduler thread would hold the lock on the events queue until a task was fully built and flushed, effectively causing it to operate in lockstep with the main thread. This has been fixed by double-buffering the event queue, with the scheduler thread only locking the mutex to swap the two.
  • Reduce contention around the task_manager's mutex by passing a task pointer to the scheduler directly, instead of it repeatedly calling task_manager::get_task (while the main thread is trying to acquire the lock for producing new tasks).

Additionally I've reduced the number of tasks generated in the "soup" topology benchmark so that the timings are in about the same order of magnitude as for the other topologies (= fits better onto a chart).

Here is a comparison between the original "lockstep" scheduler and the double-buffered "decoupled" one, but still with frequent calls to get_task. There are two bars for each benchmark in each group (same colors), with the right bar representing the decoupled scheduler. The two "ST *" groups refer to the single-threaded baseline implementation that generates commands in the main thread, circumventing the scheduler altogether (numbers in these groups should not be affected by the changes).

image

We can see that this is an easy win overall, with numbers improving across the board, in particular for the "soup" and "wave_sim" topologies. Additionally jitter is greatly reduced.

Unfortunately numbers are not quite as clear for the second change. Here we now compare the decoupled scheduler against the task pointer being passed around (the left bar here is the right bar above):

image

There are again considerable improvements for "soup" and "wave_sim", which now also outperform the single-threaded variant in most cases. Jitter is again reduced across the board. However with 4 nodes, there are some numbers that are worse, in particular for the "Immediate" group. While I was able to replicate this several times, I've also encountered an instance where performance was better everywhere, with the "Immediate" group outperforming "ST Immediate" for most topologies. I therefore attribute this "worsening" to overall jitter in our benchmarking setup, and given the significance of the "soup" and "wave_sim" results, I would still consider this an improvement overall.

Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement, thanks! Even if there is not always a benefit to the single-threaded setup, this is an important requirement for any parallel CDAG generation we might want to do in the future.

Could you add another bar chart for the TDAG times as a baseline to see what the best-case improvement of the separate scheduler thread would be?

include/scheduler.h Show resolved Hide resolved
src/scheduler.cc Outdated Show resolved Hide resolved
include/graph_generator.h Show resolved Hide resolved
include/scheduler.h Outdated Show resolved Hide resolved
src/scheduler.cc Outdated Show resolved Hide resolved
src/scheduler.cc Outdated Show resolved Hide resolved
@psalz
Copy link
Member Author

psalz commented Mar 31, 2022

Could you add another bar chart for the TDAG times as a baseline to see what the best-case improvement of the separate scheduler thread would be?

Yes! I've plotted the single-threaded TDAG+CDAG times against the dedicated scheduler thread. If we had perfect overlapping, the scheduler times would be equal to the single-threaded CDAG times. For 4 nodes we are not that far off!

image

@fknorr
Copy link
Contributor

fknorr commented Apr 2, 2022

Thanks! LGTM up to the in_flight_count remark. Some notes for future investigation into scheduler-thread performance:

It's remarkable how far results differ between topologies. It seems intuitive that the benefit should be maximal when TDAG and CDAG generation are most similar in execution time (soup in both graphs, wave_sim in the single-node case). I assume that the positive result for wave_sim in the 4-node example is simply due to jitter, because in no other scenario can the TDAG generation latency be hidden completely.

To me it seems like the 1-node soup should be the perfect condition for parallelization since TDAG and CDAG can be almost fully overlapped and the total runtime is long enough to avoid any constant overhead that might be visible in faster runs (e.g. chain). Yet we only save about 23% instead of the theoretical 45%. My guess is that since each individual task is very quick to compute, the scheduler has to block on its queue after every task, so we add an OS scheduler round-trip to each iteration.

Also it is notable that for 1-node soup, TDAG generation is more expensive than CDAG. Both generate mostly identical graphs, but the CDAG is also serialized (and contains the infamous debug_label), so I would expect it to be the other way round. Possible causes that come to mind:

  • Allocations for command group type erasure and range mappers
  • task_mutex, locked and unlocked for every task – but it is not contended here, so locking should never need a syscall

@psalz psalz force-pushed the decouple-scheduler branch from 2270306 to 6459896 Compare July 19, 2022 11:02
@psalz
Copy link
Member Author

psalz commented Jul 19, 2022

I've rebased onto master, which now includes the task ring buffer. Due to this the second optimization (passing around a task-pointer) is moot, as task_manager::get_task is now lock-free. Nevertheless, I think it's still a good thing to do.

With the ring buffer times are lower in general and the results after decoupling the scheduler are a lot less dramatic. Improvements are most noticeable for tasks where computing the command graph isn't cheap, e.g. wave_sim.

image

image

Note besides decreasing the number of tasks for the "soup" benchmark (as described above), I've now also changed the benchmarks to use the task-manager callback to notify the scheduler of newly created tasks. This has the effect that now horizons will also be built by the benchmark (previously they were silently ignored).

@psalz psalz requested review from PeterTh and fknorr July 19, 2022 12:16
@psalz psalz force-pushed the decouple-scheduler branch from 6459896 to 4819834 Compare July 19, 2022 13:22
Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Some day we should do a pass over both task and command graph code replacing DAG node pointers with references where they are guaranteed to be non-null.

Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, other than the comment on going a bit farther with the task_id -> task refactoring (could also be a separate PR).

include/scheduler.h Outdated Show resolved Hide resolved
include/task_manager.h Outdated Show resolved Hide resolved
@psalz psalz force-pushed the decouple-scheduler branch from 4819834 to 9a6a200 Compare July 25, 2022 10:49
src/task_manager.cc Outdated Show resolved Hide resolved
src/task_manager.cc Outdated Show resolved Hide resolved
psalz added 4 commits July 26, 2022 10:31
This brings the timings more in line with the other benchmarks (better
for fitting them on the same plot).
Previously we manually informed the scheduler of newly created tasks.
This meant that horizons, which are generated automatically, were not
considered. Benchmark times are worse as now more work is being done.
Don't hold mutex while generating commands. Swap between two sets of
event queues to minimize the number of times the scheduler thread needs
to lock the mutex.
@psalz psalz force-pushed the decouple-scheduler branch from 6c9cb68 to 29451b9 Compare July 26, 2022 08:32
@psalz psalz merged commit 60a5b74 into master Jul 26, 2022
@psalz psalz deleted the decouple-scheduler branch July 26, 2022 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants