Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IDAG] Generate and test Instruction Graph #249

Merged
merged 3 commits into from
Mar 18, 2024
Merged

[IDAG] Generate and test Instruction Graph #249

merged 3 commits into from
Mar 18, 2024

Conversation

fknorr
Copy link
Contributor

@fknorr fknorr commented Feb 12, 2024

This is the first in a series of PRs to replace Celerity's ad-hoc local scheduling (buffer_manager, buffer_transfer_manager, executor and worker_job) with a per-node static schedule known as the Instruction Graph (IDAG). The IDAG sits at an abstraction level below task- and command graph, and manages memory allocation, memory copies, kernel launches, and MPI operations as individual graph nodes. This moves most tracking overhead ahead in time to the scheduler and improves concurrency between these "µop" instructions at the time of execution.

It also gives Celerity two new key features:

  • Multi-device support: Current Celerity manages one device / GPU per node (MPI rank), which hurts inter-device communication performance on systems with more than one GPU per physical system. Multi-device support has been attempted before but became increasingly difficult to implement correctly around features such as reductions.
  • Multiple backing allocations per buffer: Celerity currently has to re-size a buffer's backing allocation to cover every accessed subrange, even if disjoint. The IDAG permits multiple (non-overlapping) allocations instead. In the future, this can become the basis for advanced graph-based memory allocators.

Finally, the extensible, test-friendly scheduling approach will allow us to upstream new features like collective communication or dynamic scheduling (?) with good performance and maintainability in the future.

idag-meme

Instruction Graph (IDAG)

The Instruction Graph is, unlike the task / command graph, not an intrusive_graph but references its predecessors by id. Despite what the visual representation might suggest, it only contains information strictly necessary for execution, and all debug info (dependency origins, buffer ranges, kernel names) are written to their corresponding records when recording is enabled. The graph nodes are passed directly to the executor without an additional "serialized" representation.

The IDAG (and future instruction_executor) have no notion of buffers, the abstraction it operates on are allocations, which are roughly equivalent to pointers at execution time. Allocations can either originate from from an alloc-instruction (for buffers, reduction scratch-spaces) or from "user space" (buffer host-init pointers, fence buffer_snapshots). Kernel instructions contain metadata to perform accessor hydration based on these allocations and statically known offsets into them.

For peer-to-peer MPI data transfers, push-commands can be compiled to their equivalent send-instructions in a rather straightforward manner, but await-push commands are more involved. Since at scheduling time it is not known which boxes the local node will receive from which peers, the IDAG must conservatively provide allocations that can receive the full pushed region en-bloc, and perform some of the tracking previously done by the buffer_transfer_manager in a logic called receive arbitration: To allow the receiving side to MPI_Recv into an existing allocation without unpacking it from a frame as it exists today, each send-instruction is accompanied by an pilot message that is transferred to the receiver as soon as the instruction is generated, which maps a message_id (MPI tag) to a transfer_id and buffer subrange, providing the receiver with all parameters needed to issue an MPI_Recv.

The IDAG re-uses host-buffer allocations as send and receive allocations for all MPI operations. This avoids additional complexity for allocating and pooling staging buffers, but comes at a performance cost if the MPI operations are strided (the worst-case would be sending or receiving a strided column in a 2D-split stencil, for example). This will be addressed by a future graph-allocation method which will also enable directly receiving into device-staging buffers through RDMA.

Reductions in the IDAG are performed in two stages, where an eager node-local reduction (on execution_command) precedes the lazy and optional global reduction (on reduction_command). Reductions are executed on the host and in host memory, where a scratch space is allocated to copy / receive all partial inputs before applying the reduction operator.

Any resources that are tracked besides DAG allocations and host objects (e.g. reduction functors) can be deleted from the executor's tracking structures through an instruction garbage list that is attached to each horizon and epoch.

Instruction Graph Generation

The instruction_graph_generator (IGGen) is parameterized with an abstract system configuration which defines the number of devices, the number of memories, and system capabilities. This model supports multiple devices sharing the same memory or sharing memory with the host (although that will not initially be used by the runtime) and also systems where it is not possible to memcpy between every pair of memories - this is notably true of our own test system with two Intel Arc GPUs which do not have p2p copy functionality.

The IGGen maintains state about each buffer and host object that is created on the runtime's side and naturally generates instruction nodes in topological order, i.e. a sequential execution of instructions will fulfill all internal dependencies. Multiple allocations are tracked per buffer, but they currently cannot overlap except during a resize operation. Access fronts are tracked for both reads and writes in order to compute true/anti/output-dependencies without walking the instruction graph or visiting past commands.

Whenever there are multiple producers or consumers of data (around copy, send or receive instructions), the IGGen splits these transfers to achieve maximum concurrency - i.e. no transfer instruction should / will ever introduce a synchronization point between producer- or consumer instructions that was not there before.

The buffer_manager contains functionality to spill a buffer to the host in order to perform a re-size where the source- and destination allocation would not both fit in device memory. This is missing from the IGGen; recognizing this spilling-condition will be the scope of a future graph-based memory allocator. We also do not currently require the feature in tests or any applications we develop (and it is very slow), so it will disappear at least in the meantime.

The implementation in this PR aims to be complete, correct and maintainable. Performance of both the generator and the generated graph can probably be improved in many ways, but given the high complexity that is required as-is, I did not sacrifice readability (or much time) for performance improvements. I've nevertheless added a DAG benchmark for instruction generation.

How to Review This

This PR only contains the logic to generate, test, and print the IDAG, and does not touch the runtime's execution model yet. This is done to keep the reviews as painless as possible. See the WIP instruction-graph branch if you need more context.

The most relevant resources for understanding the graph structure and generation logic up-front are the instruction_graph_tests, which show the expected properties of all (or most) situations the graph must cover. Use the test executables together with render-graphs.py to get a rendering - the IDAG associates a ton of debug information with each node.

May your eyes bleed less from reviewing than my hands did implementing this.

@fknorr fknorr requested review from psalz, PeterTh and GagaLP February 12, 2024 13:55
@fknorr fknorr self-assigned this Feb 12, 2024
@coveralls
Copy link

coveralls commented Feb 12, 2024

Pull Request Test Coverage Report for Build 8329557282

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 1566 of 1585 (98.8%) changed or added relevant lines in 19 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.7%) to 94.641%

Changes Missing Coverage Covered Lines Changed/Added Lines %
include/fence.h 2 4 50.0%
src/task.cc 14 16 87.5%
src/print_graph.cc 236 239 98.74%
src/instruction_graph_generator.cc 982 994 98.79%
Files with Coverage Reduction New Missed Lines %
src/task.cc 1 87.58%
Totals Coverage Status
Change from base Build 7976676935: 0.7%
Covered Lines: 6574
Relevant Lines: 6772

💛 - Coveralls

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/handler.h Show resolved Hide resolved
include/host_object.h Outdated Show resolved Hide resolved
include/print_utils.h Show resolved Hide resolved
include/print_utils.h Show resolved Hide resolved
include/print_utils.h Show resolved Hide resolved
test/instruction_graph_p2p_tests.cc Show resolved Hide resolved
test/instruction_graph_reduction_tests.cc Show resolved Hide resolved
test/instruction_graph_test_utils.h Show resolved Hide resolved
test/instruction_graph_test_utils.h Show resolved Hide resolved
test/instruction_graph_test_utils.h Show resolved Hide resolved
@fknorr fknorr added this to the 0.6.0 milestone Feb 12, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/print_graph.cc Show resolved Hide resolved
Copy link

Check-perf-impact results: (7b849de16ff11660b98988ab0b032db7)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / expanding tree topology, building command graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / contracting tree topology
Added microbenchmark(s): 18 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

  • command-graph : 1.05x
  • graph-nodes : 1.00x
  • grid : 1.00x
  • instruction-graph : new 🌟
  • scheduler : 1.07x
  • system : 1.02x
  • task-graph : 1.01x

@celerity celerity deleted a comment from github-actions bot Feb 13, 2024
@celerity celerity deleted a comment from github-actions bot Feb 13, 2024
@fknorr fknorr force-pushed the idag-generate branch 2 times, most recently from 3c188ab to 3343177 Compare February 20, 2024 14:24
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tremendous work!

I've looked at everything except test/instruction_* in some detail. Posting this for now.

include/command.h Show resolved Hide resolved
include/dense_map.h Outdated Show resolved Hide resolved
include/distributed_graph_generator.h Outdated Show resolved Hide resolved
include/launcher.h Outdated Show resolved Hide resolved
include/recorders.h Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked through half the tests and the test utils now.
The testing infrastructure is really nice, I like how things like select_unique<> automatically add appropriate REQUIREs.

include/instruction_graph_generator.h Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
test/instruction_graph_memory_tests.cc Show resolved Hide resolved
test/instruction_graph_test_utils.h Outdated Show resolved Hide resolved
test/instruction_graph_test_utils.h Show resolved Hide resolved
test/instruction_graph_test_utils.h Outdated Show resolved Hide resolved
test/instruction_graph_test_utils.h Show resolved Hide resolved
test/instruction_graph_misc_tests.cc Show resolved Hide resolved
test/instruction_graph_misc_tests.cc Outdated Show resolved Hide resolved
test/instruction_graph_misc_tests.cc Show resolved Hide resolved
test/instruction_graph_reduction_tests.cc Outdated Show resolved Hide resolved
test/instruction_graph_reduction_tests.cc Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/instruction_graph_generator.cc Show resolved Hide resolved
src/instruction_graph_generator.cc Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Show resolved Hide resolved
src/instruction_graph_generator.cc Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
test/instruction_graph_misc_tests.cc Show resolved Hide resolved
@fknorr
Copy link
Contributor Author

fknorr commented Feb 28, 2024

I've added an additional test suite instruction_graph_grid_tests for grid utilities that are specific to instruction generation. This only contains tests for boxes_edge_connected and connected_subregion_bounding_boxes (both relevant for determining the granularity of receive / split-receive instructions), because they had insufficient coverage from the IGGen tests so far.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

test/instruction_graph_misc_tests.cc Show resolved Hide resolved
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew, that only took me a cool 10 hours to review!

What can I say, exceptional work! I like how well-documented everything is, and how you separated the implementation of the graph generator from its interface, as well as how configurable it is and how well errors are reported (e.g. around the oversubscription hint)!

While it's obviously still very complex, I like how you untangled allocations, last writer/reader management and copy/read/write instructions from each other.

I also really like your evolution of the testing infrastructure and how it integrates with records. It looks pretty much like what I always envisioned (without knowing how to get there) for the command graph!

Unfortunately I didn't have time to go through all tests in detail, although I did sporadically check whether certain edge cases were covered by tests at all (by setting a breakpoint or printing a message).

I have some notes, most of which are typos or pretty minor stuff. I'm approving right away because I'm on holiday next week and don't want to block merging this!

I just hope I can retain some of this mental context for reviewing the instruction executor :')

include/hint.h Outdated Show resolved Hide resolved
include/launcher.h Outdated Show resolved Hide resolved
include/task.h Show resolved Hide resolved
include/types.h Show resolved Hide resolved
include/command.h Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
src/instruction_graph_generator.cc Outdated Show resolved Hide resolved
test/instruction_graph_memory_tests.cc Outdated Show resolved Hide resolved
test/print_graph_tests.cc Show resolved Hide resolved
@fknorr
Copy link
Contributor Author

fknorr commented Mar 4, 2024

I've added a check that will issue a warning if Celerity generates excessive numbers of small send commands due to split_into_communicator_compatible_boxes. This will happen e.g. when 2d-splitting a stencil on a buffer with x-dimension (dim 1) larger than 2^31 - which I hope is going to be very rare.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

test/instruction_graph_grid_tests.cc Show resolved Hide resolved
New benchmark categories have no prior data to be compared to.
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/print_graph.cc Show resolved Hide resolved
fknorr and others added 2 commits March 18, 2024 16:45
The instruction graph (IDAG) is Celerity's new static local schedule,
and will eventually replace the ad-hoc local scheduling around
buffer_manager, buffer_transfer_manager and worker_job.

This commit only adds the IDAG structure, generator, printing and tests,
and does not touch the actual execution model yet to limit the (already
enormous) size in LOC of this change.
@fknorr fknorr merged commit cc9f3fc into master Mar 18, 2024
12 checks passed
@fknorr fknorr deleted the idag-generate branch March 18, 2024 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants