-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IDAG] Switch to Instruction-Graph Scheduling #265
Conversation
Check-perf-impact results: (877795252c9a57f7b343e4747db6ca4f)
Relative execution time per category: (mean of relative medians)
|
Pull Request Test Coverage Report for Build 10213951594Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 33. Check the log or trigger a new build to see more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At last, a negative LoC balance.
Also, that's a beautiful picture, and it (and its source) should probably go in docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R.I.P. BM, BTM, et al. LGTM!
Fixes a bug with SYCL instant submission around reductions.
Check-perf-impact results: (f2e639c8a97550e58528a410c1b8586d)
Relative execution time per category: (mean of relative medians)
Edit: We inadvertently disabled mimalloc. All hail the benchmark suite! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
Check-perf-impact results: (2908f97f836fd2def14c3429cd4d61ac)
Relative execution time per category: (mean of relative medians)
|
Switches to the new IDAG-based runtime and drops all newly unused legacy components. - runtime now manages multiple devices, and the distr_queue API has been updated to reflect the fact. - buffer_manager, reduction_manager and host_object_manager are now gone. ID assignment for these types is now handled by the runtime directly, and all components interacting with buffers, reductions and host objects track relevant state themselves. lifetime_extending_state is removed. - scheduler now generates both the command- and the instruction graph in the same thread, and maintains ownership of both structures. The CDAG is pruned at generation time (since commands never leave the scheduler thread), and the IDAG is pruned once the scheduler is notified of epoch completion. Command serialization is gone, and with it, the command is_flushed marker. - runtime destruction is now delayed until the last buffer / queue / host object is destroyed. ~distr_queue will continue to epoch-synchronize. - runtime asserts that non-thread-safe functions are only called from the application thread, which will trigger onaccidental value-captures of buffers / host objects into host tasks. - Reductions are now available on all SYCL implementations since hipSYCL has added support. - log_context, which was only used by worker_job, is removed. - vendor/ctpl, which was only used by host_queue, is removed.
This is the final PR in the IDAG series. It switches to the new IDAG-based runtime and drops all newly unused legacy components.
runtime
now manages multiple devices, and thedistr_queue
API has been updated to reflect the fact.buffer_manager
,reduction_manager
andhost_object_manager
are now gone. ID assignment for these types is now handled by the runtime directly, and all components interacting with buffers, reductions and host objects (graph generators, executor and recorders) track the relevant state themselves as instructed bynotify_*_created
/_destroyed
introduced in Explicitly manage buffer / host object lifetimes in graph generation #246. As a result, tasks do not need to keep strong references to buffers and host objects around anymore (lifetime_extending_state
).scheduler
now generates both the command- and the instruction graph in the same thread, and maintains ownership of both structures. The CDAG is pruned at generation time (since commands never leave the scheduler thread), and the IDAG is pruned once the scheduler is notified of epoch completion. Command serialization is gone, and with it, the commandis_flushed
marker.runtime
now useslive_executor
(replacinglegacy_executor
andworker_job
) together with acommunicator
andbackend
instance to execute instructions.communicator
together withreceive_arbiter
replacebuffer_transfer_manager
.backend
implementations replacelegacy_backend
,host_queue
anddevice_queue
.~distr_queue
will continue to epoch-synchronize.runtime
asserts that non-thread-safe functions are only called from the application thread, which will trigger onaccidental value-captures of buffers / host objects into host tasks.#ifdefs
in tests and frontend code.log_context
, which was only used byworker_job
, is removed.vendor/ctpl
, which was only used byhost_queue
, is removed.Since one node now addresses multiple GPUs, scheduling becomes more expensive (IDAG generation is maybe ~4x as expensive as CDAG generation). This will be visible in benchmark results.