Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IDAG] Switch to Instruction-Graph Scheduling #265

Merged
merged 3 commits into from
Aug 2, 2024
Merged

Conversation

fknorr
Copy link
Contributor

@fknorr fknorr commented Jul 24, 2024

This is the final PR in the IDAG series. It switches to the new IDAG-based runtime and drops all newly unused legacy components.

new runtime architecture

  • runtime now manages multiple devices, and the distr_queue API has been updated to reflect the fact.
  • buffer_manager, reduction_manager and host_object_manager are now gone. ID assignment for these types is now handled by the runtime directly, and all components interacting with buffers, reductions and host objects (graph generators, executor and recorders) track the relevant state themselves as instructed by notify_*_created/_destroyed introduced in Explicitly manage buffer / host object lifetimes in graph generation #246. As a result, tasks do not need to keep strong references to buffers and host objects around anymore (lifetime_extending_state).
  • scheduler now generates both the command- and the instruction graph in the same thread, and maintains ownership of both structures. The CDAG is pruned at generation time (since commands never leave the scheduler thread), and the IDAG is pruned once the scheduler is notified of epoch completion. Command serialization is gone, and with it, the command is_flushed marker.
  • The runtime now uses live_executor (replacing legacy_executor and worker_job) together with a communicator and backend instance to execute instructions. communicator together with receive_arbiter replace buffer_transfer_manager. backend implementations replace legacy_backend, host_queue and device_queue.
  • Runtime destruction is now delayed until the last buffer / queue / host object is destroyed. With this change, we can stop distinguishing between a non-existing and a shut-down runtime. For backwards compatibility, ~distr_queue will continue to epoch-synchronize. runtime asserts that non-thread-safe functions are only called from the application thread, which will trigger onaccidental value-captures of buffers / host objects into host tasks.
  • Reductions are now available on all SYCL implementations since hipSYCL has added support. This allows us to drop a lot of #ifdefs in tests and frontend code.
  • log_context, which was only used by worker_job, is removed.
  • vendor/ctpl, which was only used by host_queue, is removed.

Since one node now addresses multiple GPUs, scheduling becomes more expensive (IDAG generation is maybe ~4x as expensive as CDAG generation). This will be visible in benchmark results.

@fknorr fknorr requested review from psalz and PeterTh July 24, 2024 11:18
@fknorr fknorr self-assigned this Jul 24, 2024
Copy link

Check-perf-impact results: (877795252c9a57f7b343e4747db6ca4f)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: 7 individual benchmarks affected
Added microbenchmark(s): 48 individual benchmarks affected
Removed microbenchmark(s): 48 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

  • command-graph : 0.94x
  • graph-nodes : 1.01x
  • grid : 1.04x
  • instruction-graph : 1.02x
  • scheduler : new 🌟
  • system : 3.30x ⚠️
  • task-graph : 1.08x

@coveralls
Copy link

coveralls commented Jul 24, 2024

Pull Request Test Coverage Report for Build 10213951594

Details

  • 368 of 368 (100.0%) changed or added relevant lines in 19 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+1.8%) to 94.7%

Files with Coverage Reduction New Missed Lines %
src/task.cc 1 92.06%
Totals Coverage Status
Change from base Build 10143808743: 1.8%
Covered Lines: 6564
Relevant Lines: 6700

💛 - Coveralls

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 33. Check the log or trigger a new build to see more.

examples/reduction/reduction.cc Show resolved Hide resolved
examples/reduction/reduction.cc Show resolved Hide resolved
examples/reduction/reduction.cc Show resolved Hide resolved
include/buffer.h Show resolved Hide resolved
include/handler.h Show resolved Hide resolved
include/runtime.h Show resolved Hide resolved
include/runtime.h Show resolved Hide resolved
include/runtime.h Show resolved Hide resolved
include/runtime.h Show resolved Hide resolved
include/runtime.h Show resolved Hide resolved
@fknorr fknorr added this to the 0.6.0 milestone Jul 24, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/scheduler.h Show resolved Hide resolved
include/sycl_wrappers.h Show resolved Hide resolved
include/sycl_wrappers.h Show resolved Hide resolved
src/runtime.cc Show resolved Hide resolved
test/runtime_deprecation_tests.cc Show resolved Hide resolved
test/sycl_tests.cc Show resolved Hide resolved
test/sycl_tests.cc Show resolved Hide resolved
test/task_ring_buffer_tests.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At last, a negative LoC balance.

Also, that's a beautiful picture, and it (and its source) should probably go in docs.

include/distr_queue.h Show resolved Hide resolved
include/fence.h Show resolved Hide resolved
include/log.h Show resolved Hide resolved
include/task_manager.h Show resolved Hide resolved
src/runtime.cc Outdated Show resolved Hide resolved
src/runtime.cc Outdated Show resolved Hide resolved
Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R.I.P. BM, BTM, et al. LGTM!

include/celerity.h Outdated Show resolved Hide resolved
src/config.cc Show resolved Hide resolved
src/legacy_executor.cc Show resolved Hide resolved
src/scheduler.cc Show resolved Hide resolved
test/test_utils.cc Outdated Show resolved Hide resolved
Fixes a bug with SYCL instant submission around reductions.
Copy link

github-actions bot commented Aug 1, 2024

Check-perf-impact results: (f2e639c8a97550e58528a410c1b8586d)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: 8 individual benchmarks affected
Added microbenchmark(s): 48 individual benchmarks affected
Removed microbenchmark(s): 48 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

  • command-graph : 1.12x ⚠️
  • graph-nodes : 1.02x
  • grid : 1.03x
  • instruction-graph : 1.15x ⚠️
  • scheduler : new 🌟
  • system : 3.44x ⚠️
  • task-graph : 1.24x ⚠️

Edit: We inadvertently disabled mimalloc. All hail the benchmark suite!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/sycl_wrappers.h Show resolved Hide resolved
Copy link

github-actions bot commented Aug 2, 2024

Check-perf-impact results: (2908f97f836fd2def14c3429cd4d61ac)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: 5 individual benchmarks affected
🚀 Significant speedup (<0.80x) in some microbenchmark results: generating large command graphs for N nodes - 1 / chain topology
Added microbenchmark(s): 48 individual benchmarks affected
Removed microbenchmark(s): 48 individual benchmarks affected

Relative execution time per category: (mean of relative medians)

  • command-graph : 0.89x 🚀
  • graph-nodes : 1.03x
  • grid : 1.00x
  • instruction-graph : 0.99x
  • scheduler : new 🌟
  • system : 3.24x ⚠️
  • task-graph : 0.98x

@celerity celerity deleted a comment from github-actions bot Aug 2, 2024
fknorr added 2 commits August 2, 2024 12:17
Switches to the new IDAG-based runtime and drops all newly unused legacy components.

- runtime now manages multiple devices, and the distr_queue API has been updated to reflect the fact.
- buffer_manager, reduction_manager and host_object_manager are now gone. ID assignment for these types is now handled by the runtime directly, and all components interacting with buffers, reductions and host objects track relevant state themselves. lifetime_extending_state is removed.
- scheduler now generates both the command- and the instruction graph in the same thread, and maintains ownership of both structures. The CDAG is pruned at generation time (since commands never leave the scheduler thread), and the IDAG is pruned once the scheduler is notified of epoch completion. Command serialization is gone, and with it, the command is_flushed marker.
- runtime destruction is now delayed until the last buffer / queue / host object is destroyed. ~distr_queue will continue to epoch-synchronize.
- runtime asserts that non-thread-safe functions are only called from the application thread, which will trigger onaccidental value-captures of buffers / host objects into host tasks.
- Reductions are now available on all SYCL implementations since hipSYCL has added support.
- log_context, which was only used by worker_job, is removed.
- vendor/ctpl, which was only used by host_queue, is removed.
@fknorr fknorr merged commit 3f10139 into master Aug 2, 2024
28 of 30 checks passed
@fknorr fknorr deleted the idag-runtime branch August 2, 2024 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants