-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve scheduler performance by reducing lock contention #111
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvement, thanks! Even if there is not always a benefit to the single-threaded setup, this is an important requirement for any parallel CDAG generation we might want to do in the future.
Could you add another bar chart for the TDAG times as a baseline to see what the best-case improvement of the separate scheduler thread would be?
Yes! I've plotted the single-threaded TDAG+CDAG times against the dedicated scheduler thread. If we had perfect overlapping, the scheduler times would be equal to the single-threaded CDAG times. For 4 nodes we are not that far off! |
Thanks! LGTM up to the It's remarkable how far results differ between topologies. It seems intuitive that the benefit should be maximal when TDAG and CDAG generation are most similar in execution time ( To me it seems like the 1-node soup should be the perfect condition for parallelization since TDAG and CDAG can be almost fully overlapped and the total runtime is long enough to avoid any constant overhead that might be visible in faster runs (e.g. Also it is notable that for 1-node soup, TDAG generation is more expensive than CDAG. Both generate mostly identical graphs, but the CDAG is also serialized (and contains the infamous
|
2270306
to
6459896
Compare
I've rebased onto master, which now includes the task ring buffer. Due to this the second optimization (passing around a task-pointer) is moot, as With the ring buffer times are lower in general and the results after decoupling the scheduler are a lot less dramatic. Improvements are most noticeable for tasks where computing the command graph isn't cheap, e.g. wave_sim. Note besides decreasing the number of tasks for the "soup" benchmark (as described above), I've now also changed the benchmarks to use the task-manager callback to notify the scheduler of newly created tasks. This has the effect that now horizons will also be built by the benchmark (previously they were silently ignored). |
6459896
to
4819834
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Some day we should do a pass over both task and command graph code replacing DAG node pointers with references where they are guaranteed to be non-null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, other than the comment on going a bit farther with the task_id -> task refactoring (could also be a separate PR).
4819834
to
9a6a200
Compare
This brings the timings more in line with the other benchmarks (better for fitting them on the same plot).
Previously we manually informed the scheduler of newly created tasks. This meant that horizons, which are generated automatically, were not considered. Benchmark times are worse as now more work is being done.
Don't hold mutex while generating commands. Swap between two sets of event queues to minimize the number of times the scheduler thread needs to lock the mutex.
6c9cb68
to
29451b9
Compare
This PR is based on #109 so that one should be merged first.
This improves the performance of the scheduler in two ways:
events
queue until a task was fully built and flushed, effectively causing it to operate in lockstep with the main thread. This has been fixed by double-buffering the event queue, with the scheduler thread only locking the mutex to swap the two.task_manager
's mutex by passing a task pointer to the scheduler directly, instead of it repeatedly callingtask_manager::get_task
(while the main thread is trying to acquire the lock for producing new tasks).Additionally I've reduced the number of tasks generated in the "soup" topology benchmark so that the timings are in about the same order of magnitude as for the other topologies (= fits better onto a chart).
Here is a comparison between the original "lockstep" scheduler and the double-buffered "decoupled" one, but still with frequent calls to
get_task
. There are two bars for each benchmark in each group (same colors), with the right bar representing the decoupled scheduler. The two "ST *" groups refer to the single-threaded baseline implementation that generates commands in the main thread, circumventing the scheduler altogether (numbers in these groups should not be affected by the changes).We can see that this is an easy win overall, with numbers improving across the board, in particular for the "soup" and "wave_sim" topologies. Additionally jitter is greatly reduced.
Unfortunately numbers are not quite as clear for the second change. Here we now compare the decoupled scheduler against the task pointer being passed around (the left bar here is the right bar above):
There are again considerable improvements for "soup" and "wave_sim", which now also outperform the single-threaded variant in most cases. Jitter is again reduced across the board. However with 4 nodes, there are some numbers that are worse, in particular for the "Immediate" group. While I was able to replicate this several times, I've also encountered an instance where performance was better everywhere, with the "Immediate" group outperforming "ST Immediate" for most topologies. I therefore attribute this "worsening" to overall jitter in our benchmarking setup, and given the significance of the "soup" and "wave_sim" results, I would still consider this an improvement overall.