-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IDAG] Generate and test Instruction Graph #249
Conversation
Pull Request Test Coverage Report for Build 8329557282Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
594d5d5
to
86cab0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
Check-perf-impact results: (7b849de16ff11660b98988ab0b032db7)
Relative execution time per category: (mean of relative medians)
|
3c188ab
to
3343177
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
afb78cf
to
bb55a77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tremendous work!
I've looked at everything except test/instruction_*
in some detail. Posting this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked through half the tests and the test utils now.
The testing infrastructure is really nice, I like how things like select_unique<>
automatically add appropriate REQUIRE
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
I've added an additional test suite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew, that only took me a cool 10 hours to review!
What can I say, exceptional work! I like how well-documented everything is, and how you separated the implementation of the graph generator from its interface, as well as how configurable it is and how well errors are reported (e.g. around the oversubscription hint)!
While it's obviously still very complex, I like how you untangled allocations, last writer/reader management and copy/read/write instructions from each other.
I also really like your evolution of the testing infrastructure and how it integrates with records. It looks pretty much like what I always envisioned (without knowing how to get there) for the command graph!
Unfortunately I didn't have time to go through all tests in detail, although I did sporadically check whether certain edge cases were covered by tests at all (by setting a breakpoint or printing a message).
I have some notes, most of which are typos or pretty minor stuff. I'm approving right away because I'm on holiday next week and don't want to block merging this!
I just hope I can retain some of this mental context for reviewing the instruction executor :')
I've added a check that will issue a warning if Celerity generates excessive numbers of small send commands due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
New benchmark categories have no prior data to be compared to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
The instruction graph (IDAG) is Celerity's new static local schedule, and will eventually replace the ad-hoc local scheduling around buffer_manager, buffer_transfer_manager and worker_job. This commit only adds the IDAG structure, generator, printing and tests, and does not touch the actual execution model yet to limit the (already enormous) size in LOC of this change.
This is the first in a series of PRs to replace Celerity's ad-hoc local scheduling (
buffer_manager
,buffer_transfer_manager
,executor
andworker_job
) with a per-node static schedule known as the Instruction Graph (IDAG). The IDAG sits at an abstraction level below task- and command graph, and manages memory allocation, memory copies, kernel launches, and MPI operations as individual graph nodes. This moves most tracking overhead ahead in time to the scheduler and improves concurrency between these "µop" instructions at the time of execution.It also gives Celerity two new key features:
Finally, the extensible, test-friendly scheduling approach will allow us to upstream new features like collective communication or dynamic scheduling (?) with good performance and maintainability in the future.
Instruction Graph (IDAG)
The Instruction Graph is, unlike the task / command graph, not an
intrusive_graph
but references its predecessors by id. Despite what the visual representation might suggest, it only contains information strictly necessary for execution, and all debug info (dependency origins, buffer ranges, kernel names) are written to their corresponding records when recording is enabled. The graph nodes are passed directly to the executor without an additional "serialized" representation.The IDAG (and future
instruction_executor
) have no notion of buffers, the abstraction it operates on are allocations, which are roughly equivalent to pointers at execution time. Allocations can either originate from from an alloc-instruction (for buffers, reduction scratch-spaces) or from "user space" (buffer host-init pointers, fencebuffer_snapshots
). Kernel instructions contain metadata to perform accessor hydration based on these allocations and statically known offsets into them.For peer-to-peer MPI data transfers, push-commands can be compiled to their equivalent send-instructions in a rather straightforward manner, but await-push commands are more involved. Since at scheduling time it is not known which boxes the local node will receive from which peers, the IDAG must conservatively provide allocations that can receive the full pushed region en-bloc, and perform some of the tracking previously done by the
buffer_transfer_manager
in a logic called receive arbitration: To allow the receiving side toMPI_Recv
into an existing allocation without unpacking it from aframe
as it exists today, each send-instruction is accompanied by an pilot message that is transferred to the receiver as soon as the instruction is generated, which maps amessage_id
(MPI tag) to atransfer_id
and buffer subrange, providing the receiver with all parameters needed to issue anMPI_Recv
.The IDAG re-uses host-buffer allocations as send and receive allocations for all MPI operations. This avoids additional complexity for allocating and pooling staging buffers, but comes at a performance cost if the MPI operations are strided (the worst-case would be sending or receiving a strided column in a 2D-split stencil, for example). This will be addressed by a future graph-allocation method which will also enable directly receiving into device-staging buffers through RDMA.
Reductions in the IDAG are performed in two stages, where an eager node-local reduction (on
execution_command
) precedes the lazy and optional global reduction (onreduction_command
). Reductions are executed on the host and in host memory, where a scratch space is allocated to copy / receive all partial inputs before applying the reduction operator.Any resources that are tracked besides DAG allocations and host objects (e.g. reduction functors) can be deleted from the executor's tracking structures through an instruction garbage list that is attached to each horizon and epoch.
Instruction Graph Generation
The
instruction_graph_generator
(IGGen) is parameterized with an abstract system configuration which defines the number of devices, the number of memories, and system capabilities. This model supports multiple devices sharing the same memory or sharing memory with the host (although that will not initially be used by the runtime) and also systems where it is not possible to memcpy between every pair of memories - this is notably true of our own test system with two Intel Arc GPUs which do not have p2p copy functionality.The IGGen maintains state about each buffer and host object that is created on the runtime's side and naturally generates instruction nodes in topological order, i.e. a sequential execution of instructions will fulfill all internal dependencies. Multiple allocations are tracked per buffer, but they currently cannot overlap except during a resize operation. Access fronts are tracked for both reads and writes in order to compute true/anti/output-dependencies without walking the instruction graph or visiting past commands.
Whenever there are multiple producers or consumers of data (around copy, send or receive instructions), the IGGen splits these transfers to achieve maximum concurrency - i.e. no transfer instruction should / will ever introduce a synchronization point between producer- or consumer instructions that was not there before.
The
buffer_manager
contains functionality to spill a buffer to the host in order to perform a re-size where the source- and destination allocation would not both fit in device memory. This is missing from the IGGen; recognizing this spilling-condition will be the scope of a future graph-based memory allocator. We also do not currently require the feature in tests or any applications we develop (and it is very slow), so it will disappear at least in the meantime.The implementation in this PR aims to be complete, correct and maintainable. Performance of both the generator and the generated graph can probably be improved in many ways, but given the high complexity that is required as-is, I did not sacrifice readability (or much time) for performance improvements. I've nevertheless added a DAG benchmark for instruction generation.
How to Review This
This PR only contains the logic to generate, test, and print the IDAG, and does not touch the runtime's execution model yet. This is done to keep the reviews as painless as possible. See the WIP instruction-graph branch if you need more context.
The most relevant resources for understanding the graph structure and generation logic up-front are the
instruction_graph_tests
, which show the expected properties of all (or most) situations the graph must cover. Use the test executables together with render-graphs.py to get a rendering - the IDAG associates a ton of debug information with each node.