[IDAG] Generate and test Instruction Graph #249

fknorr · 2024-02-12T13:55:30Z

This is the first in a series of PRs to replace Celerity's ad-hoc local scheduling (buffer_manager, buffer_transfer_manager, executor and worker_job) with a per-node static schedule known as the Instruction Graph (IDAG). The IDAG sits at an abstraction level below task- and command graph, and manages memory allocation, memory copies, kernel launches, and MPI operations as individual graph nodes. This moves most tracking overhead ahead in time to the scheduler and improves concurrency between these "µop" instructions at the time of execution.

It also gives Celerity two new key features:

Multi-device support: Current Celerity manages one device / GPU per node (MPI rank), which hurts inter-device communication performance on systems with more than one GPU per physical system. Multi-device support has been attempted before but became increasingly difficult to implement correctly around features such as reductions.
Multiple backing allocations per buffer: Celerity currently has to re-size a buffer's backing allocation to cover every accessed subrange, even if disjoint. The IDAG permits multiple (non-overlapping) allocations instead. In the future, this can become the basis for advanced graph-based memory allocators.

Finally, the extensible, test-friendly scheduling approach will allow us to upstream new features like collective communication or dynamic scheduling (?) with good performance and maintainability in the future.

Instruction Graph (IDAG)

The Instruction Graph is, unlike the task / command graph, not an intrusive_graph but references its predecessors by id. Despite what the visual representation might suggest, it only contains information strictly necessary for execution, and all debug info (dependency origins, buffer ranges, kernel names) are written to their corresponding records when recording is enabled. The graph nodes are passed directly to the executor without an additional "serialized" representation.

The IDAG (and future instruction_executor) have no notion of buffers, the abstraction it operates on are allocations, which are roughly equivalent to pointers at execution time. Allocations can either originate from from an alloc-instruction (for buffers, reduction scratch-spaces) or from "user space" (buffer host-init pointers, fence buffer_snapshots). Kernel instructions contain metadata to perform accessor hydration based on these allocations and statically known offsets into them.

For peer-to-peer MPI data transfers, push-commands can be compiled to their equivalent send-instructions in a rather straightforward manner, but await-push commands are more involved. Since at scheduling time it is not known which boxes the local node will receive from which peers, the IDAG must conservatively provide allocations that can receive the full pushed region en-bloc, and perform some of the tracking previously done by the buffer_transfer_manager in a logic called receive arbitration: To allow the receiving side to MPI_Recv into an existing allocation without unpacking it from a frame as it exists today, each send-instruction is accompanied by an pilot message that is transferred to the receiver as soon as the instruction is generated, which maps a message_id (MPI tag) to a transfer_id and buffer subrange, providing the receiver with all parameters needed to issue an MPI_Recv.

The IDAG re-uses host-buffer allocations as send and receive allocations for all MPI operations. This avoids additional complexity for allocating and pooling staging buffers, but comes at a performance cost if the MPI operations are strided (the worst-case would be sending or receiving a strided column in a 2D-split stencil, for example). This will be addressed by a future graph-allocation method which will also enable directly receiving into device-staging buffers through RDMA.

Reductions in the IDAG are performed in two stages, where an eager node-local reduction (on execution_command) precedes the lazy and optional global reduction (on reduction_command). Reductions are executed on the host and in host memory, where a scratch space is allocated to copy / receive all partial inputs before applying the reduction operator.

Any resources that are tracked besides DAG allocations and host objects (e.g. reduction functors) can be deleted from the executor's tracking structures through an instruction garbage list that is attached to each horizon and epoch.

Instruction Graph Generation

The instruction_graph_generator (IGGen) is parameterized with an abstract system configuration which defines the number of devices, the number of memories, and system capabilities. This model supports multiple devices sharing the same memory or sharing memory with the host (although that will not initially be used by the runtime) and also systems where it is not possible to memcpy between every pair of memories - this is notably true of our own test system with two Intel Arc GPUs which do not have p2p copy functionality.

The IGGen maintains state about each buffer and host object that is created on the runtime's side and naturally generates instruction nodes in topological order, i.e. a sequential execution of instructions will fulfill all internal dependencies. Multiple allocations are tracked per buffer, but they currently cannot overlap except during a resize operation. Access fronts are tracked for both reads and writes in order to compute true/anti/output-dependencies without walking the instruction graph or visiting past commands.

Whenever there are multiple producers or consumers of data (around copy, send or receive instructions), the IGGen splits these transfers to achieve maximum concurrency - i.e. no transfer instruction should / will ever introduce a synchronization point between producer- or consumer instructions that was not there before.

The buffer_manager contains functionality to spill a buffer to the host in order to perform a re-size where the source- and destination allocation would not both fit in device memory. This is missing from the IGGen; recognizing this spilling-condition will be the scope of a future graph-based memory allocator. We also do not currently require the feature in tests or any applications we develop (and it is very slow), so it will disappear at least in the meantime.

The implementation in this PR aims to be complete, correct and maintainable. Performance of both the generator and the generated graph can probably be improved in many ways, but given the high complexity that is required as-is, I did not sacrifice readability (or much time) for performance improvements. I've nevertheless added a DAG benchmark for instruction generation.

How to Review This

This PR only contains the logic to generate, test, and print the IDAG, and does not touch the runtime's execution model yet. This is done to keep the reviews as painless as possible. See the WIP instruction-graph branch if you need more context.

The most relevant resources for understanding the graph structure and generation logic up-front are the instruction_graph_tests, which show the expected properties of all (or most) situations the graph must cover. Use the test executables together with render-graphs.py to get a rendering - the IDAG associates a ton of debug information with each node.

May your eyes bleed less from reviewing than my hands did implementing this.

coveralls · 2024-02-12T14:00:27Z

Pull Request Test Coverage Report for Build 8329557282

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

1566 of 1585 (98.8%) changed or added relevant lines in 19 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.7%) to 94.641%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
include/fence.h	2	4	50.0%
src/task.cc	14	16	87.5%
src/print_graph.cc	236	239	98.74%
src/instruction_graph_generator.cc	982	994	98.79%

Files with Coverage Reduction	New Missed Lines	%
src/task.cc	1	87.58%

Totals
Change from base Build 7976676935:	0.7%
Covered Lines:	6574
Relevant Lines:	6772

💛 - Coveralls

github-actions

clang-tidy made some suggestions

include/handler.h

include/host_object.h

include/print_utils.h

test/instruction_graph_p2p_tests.cc

test/instruction_graph_reduction_tests.cc

test/instruction_graph_test_utils.h

github-actions

clang-tidy made some suggestions

src/print_graph.cc

github-actions · 2024-02-13T12:16:15Z

Check-perf-impact results: (7b849de16ff11660b98988ab0b032db7)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / contracting tree topology
➕ Added microbenchmark(s): 18 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

command-graph : 1.05x
graph-nodes : 1.00x
grid : 1.00x
instruction-graph : new 🌟
scheduler : 1.07x
system : 1.02x
task-graph : 1.01x

github-actions

clang-tidy made some suggestions

src/instruction_graph_generator.cc

github-actions

clang-tidy made some suggestions

src/instruction_graph_generator.cc

PeterTh

Tremendous work!

I've looked at everything except test/instruction_* in some detail. Posting this for now.

include/command.h

include/dense_map.h

include/distributed_graph_generator.h

include/launcher.h

include/recorders.h

src/instruction_graph_generator.cc

github-actions

clang-tidy made some suggestions

src/instruction_graph_generator.cc

PeterTh

I've looked through half the tests and the test utils now.
The testing infrastructure is really nice, I like how things like select_unique<> automatically add appropriate REQUIREs.

include/instruction_graph_generator.h

src/instruction_graph_generator.cc

test/instruction_graph_memory_tests.cc

test/instruction_graph_test_utils.h

test/instruction_graph_misc_tests.cc

test/instruction_graph_reduction_tests.cc

github-actions

clang-tidy made some suggestions

src/instruction_graph_generator.cc

test/instruction_graph_misc_tests.cc

fknorr · 2024-02-28T16:07:07Z

I've added an additional test suite instruction_graph_grid_tests for grid utilities that are specific to instruction generation. This only contains tests for boxes_edge_connected and connected_subregion_bounding_boxes (both relevant for determining the granularity of receive / split-receive instructions), because they had insufficient coverage from the IGGen tests so far.

github-actions

clang-tidy made some suggestions

test/instruction_graph_misc_tests.cc

psalz

Whew, that only took me a cool 10 hours to review!

What can I say, exceptional work! I like how well-documented everything is, and how you separated the implementation of the graph generator from its interface, as well as how configurable it is and how well errors are reported (e.g. around the oversubscription hint)!

While it's obviously still very complex, I like how you untangled allocations, last writer/reader management and copy/read/write instructions from each other.

I also really like your evolution of the testing infrastructure and how it integrates with records. It looks pretty much like what I always envisioned (without knowing how to get there) for the command graph!

Unfortunately I didn't have time to go through all tests in detail, although I did sporadically check whether certain edge cases were covered by tests at all (by setting a breakpoint or printing a message).

I have some notes, most of which are typos or pretty minor stuff. I'm approving right away because I'm on holiday next week and don't want to block merging this!

I just hope I can retain some of this mental context for reviewing the instruction executor :')

include/hint.h

include/launcher.h

include/task.h

include/types.h

include/command.h

src/instruction_graph_generator.cc

test/instruction_graph_memory_tests.cc

test/print_graph_tests.cc

fknorr · 2024-03-04T14:10:41Z

I've added a check that will issue a warning if Celerity generates excessive numbers of small send commands due to split_into_communicator_compatible_boxes. This will happen e.g. when 2d-splitting a stencil on a buffer with x-dimension (dim 1) larger than 2^31 - which I hope is going to be very rare.

github-actions

clang-tidy made some suggestions

test/instruction_graph_grid_tests.cc

New benchmark categories have no prior data to be compared to.

github-actions

clang-tidy made some suggestions

src/print_graph.cc

The instruction graph (IDAG) is Celerity's new static local schedule, and will eventually replace the ad-hoc local scheduling around buffer_manager, buffer_transfer_manager and worker_job. This commit only adds the IDAG structure, generator, printing and tests, and does not touch the actual execution model yet to limit the (already enormous) size in LOC of this change.

fknorr requested review from psalz, PeterTh and GagaLP February 12, 2024 13:55

fknorr self-assigned this Feb 12, 2024

github-actions bot reviewed Feb 12, 2024

View reviewed changes

fknorr added this to the 0.6.0 milestone Feb 12, 2024

fknorr force-pushed the idag-generate branch from 594d5d5 to 86cab0a Compare February 12, 2024 17:28

github-actions bot reviewed Feb 13, 2024

View reviewed changes

src/print_graph.cc Show resolved Hide resolved

celerity deleted a comment from github-actions bot Feb 13, 2024

fknorr mentioned this pull request Feb 13, 2024

Explicitly manage buffer / host object lifetimes in graph generation #246

Merged

fknorr force-pushed the idag-generate branch 2 times, most recently from 3c188ab to 3343177 Compare February 20, 2024 14:24

github-actions bot reviewed Feb 20, 2024

View reviewed changes

src/instruction_graph_generator.cc Outdated Show resolved Hide resolved

fknorr force-pushed the idag-generate branch from afb78cf to bb55a77 Compare February 20, 2024 16:23

github-actions bot reviewed Feb 21, 2024

View reviewed changes

src/instruction_graph_generator.cc Outdated Show resolved Hide resolved

PeterTh reviewed Feb 21, 2024

View reviewed changes

github-actions bot reviewed Feb 22, 2024

View reviewed changes

src/instruction_graph_generator.cc Outdated Show resolved Hide resolved

PeterTh reviewed Feb 23, 2024

View reviewed changes

PeterTh reviewed Feb 26, 2024

View reviewed changes

github-actions bot reviewed Feb 28, 2024

View reviewed changes

test/instruction_graph_misc_tests.cc Show resolved Hide resolved

psalz approved these changes Mar 1, 2024

View reviewed changes

fknorr force-pushed the idag-generate branch from 2e03dcc to b2251a8 Compare March 10, 2024 16:01

github-actions bot reviewed Mar 11, 2024

View reviewed changes

test/instruction_graph_grid_tests.cc Show resolved Hide resolved

fknorr mentioned this pull request Mar 12, 2024

[IDAG] Communication & Receive Arbitration #252

Merged

psalz approved these changes Mar 14, 2024

View reviewed changes

CI: Correctly handle new benchmark categories in check-perf-impact.rb

8323eb9

New benchmark categories have no prior data to be compared to.

github-actions bot reviewed Mar 18, 2024

View reviewed changes

src/print_graph.cc Show resolved Hide resolved

fknorr force-pushed the idag-generate branch from 133c0da to aa893b9 Compare March 18, 2024 15:30

fknorr and others added 2 commits March 18, 2024 16:45

Update benchmark results for IDAG generation

b9d8703

fknorr force-pushed the idag-generate branch from aa893b9 to b9d8703 Compare March 18, 2024 15:48

fknorr merged commit cc9f3fc into master Mar 18, 2024
12 checks passed

fknorr deleted the idag-generate branch March 18, 2024 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IDAG] Generate and test Instruction Graph #249

[IDAG] Generate and test Instruction Graph #249

fknorr commented Feb 12, 2024 •

edited

Loading

coveralls commented Feb 12, 2024 •

edited

Loading

github-actions bot left a comment

github-actions bot left a comment

github-actions bot commented Feb 13, 2024

github-actions bot left a comment

github-actions bot left a comment

PeterTh left a comment

github-actions bot left a comment

PeterTh left a comment

github-actions bot left a comment

fknorr commented Feb 28, 2024

github-actions bot left a comment

psalz left a comment

fknorr commented Mar 4, 2024

github-actions bot left a comment

github-actions bot left a comment

[IDAG] Generate and test Instruction Graph #249

[IDAG] Generate and test Instruction Graph #249

Conversation

fknorr commented Feb 12, 2024 • edited Loading

Instruction Graph (IDAG)

Instruction Graph Generation

How to Review This

coveralls commented Feb 12, 2024 • edited Loading

Pull Request Test Coverage Report for Build 8329557282

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 13, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

PeterTh left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

PeterTh left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

fknorr commented Feb 28, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

psalz left a comment

Choose a reason for hiding this comment

fknorr commented Mar 4, 2024

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

fknorr commented Feb 12, 2024 •

edited

Loading

coveralls commented Feb 12, 2024 •

edited

Loading