Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate with Tracy profiler #267

Merged
merged 2 commits into from
Aug 9, 2024
Merged

Integrate with Tracy profiler #267

merged 2 commits into from
Aug 9, 2024

Conversation

fknorr
Copy link
Contributor

@fknorr fknorr commented Aug 2, 2024

This integrates the runtime (particularly scheduler and executor) with Tracy version 0.11 for profiling support. This will allow us to optimize the runtime, and users to find what kernels and other operations their application is spending the most time in.

screenshot

To compile Celerity with Tracy support, pass -DCELERITY_TRACY_SUPPORT=ON (default OFF) to CMake. To enable tracing at runtime, populate the environment variable CELERITY_TRACY=off|fast|full (default off). fast emits the same zones as full, but avoids attaching debug information that takes measurable time to stringify inside Celerity. Leaving the runtime setting to off should have no visible performance impact.

Inside Celerity, tracing is made available through the CELERITY_DETAIL_TRACY_* macros from tracy.h, which expand to Tracy code when support is enabled, or to an empty token list when disabled. Inside live_executor.cc, additional context is maintained to track the asynchronous completion of instructions as Fibers. Also, the arguments formerly passed to the CELERITY_TRACE macro can be captured and copied into zone text to avoid duplicating the debug stringification.

The PR performs four additional, minor changes to the runtime:

  • Move host_config (an MPI dependency) out of config in order to parse environment and enable / disable Tracy before the call to MPI_Init.
  • Auto-enable kernel profiling when CELERITY_TRACY=full
  • Add estimated_global_memory_traffic_bytes attribute to device_kernel_instruction. It estimates the SM <-> DRAM traffic inside a kernel from accessor data types and range mappers, which allows us to print a throughput estimate on device kernel zones, which is relevant for weak-scaling problems.
  • Add backend::init, which triggers any potential backend device initialization on executor thread startup, which would otherwise delay (and distort the execution time of) the first instructions executed per device.

In CI, Tracy support is enabled in (and only in) the SimSYCL release build, so we do get notified about compile errors but not much else. Let me know if you can come up with a better automation setup.

Please experiment with this integration a bit and feel free to push fixups for zone texts and the somewhat arbitrary zone "color scheme" I have come up with.

@fknorr fknorr added this to the 0.6.0 milestone Aug 2, 2024
@fknorr fknorr requested review from psalz, PeterTh and GagaLP August 2, 2024 11:47
@fknorr fknorr self-assigned this Aug 2, 2024
Copy link

github-actions bot commented Aug 2, 2024

Check-perf-impact results: (2908f97f836fd2def14c3429cd4d61ac)

❓ No new benchmark data submitted. ❓
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

Copy link

github-actions bot commented Aug 2, 2024

Check-perf-impact results: (5219aaa3cd4bf66c40181c2c7f113381)

✔️ No significant performance change in the microbenchmark set. You are good to go!

Relative execution time per category: (mean of relative medians)

  • command-graph : 1.03x
  • graph-nodes : 0.96x
  • grid : 1.00x
  • instruction-graph : 1.01x
  • scheduler : 1.02x
  • system : 0.99x
  • task-graph : 1.04x

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/print_utils.h Show resolved Hide resolved
src/config.cc Show resolved Hide resolved
@fknorr
Copy link
Contributor Author

fknorr commented Aug 4, 2024

I've bumped Tracy to a post-release version and since wolfpld/tracy#854 was merged. This allows us to explicitly specify the order of threads within the profiler view.

We might want to add documentation on how to run a Celerity application against Tracy.

Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff.

Some inline notes, also, this PR should update CHANGELOG.md!

CMakeLists.txt Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
include/print_utils.h Outdated Show resolved Hide resolved
src/live_executor.cc Show resolved Hide resolved
include/tracy.h Show resolved Hide resolved
src/runtime.cc Show resolved Hide resolved
test/utils_tests.cc Show resolved Hide resolved
@fknorr
Copy link
Contributor Author

fknorr commented Aug 6, 2024

I've also added print_utils_tests.cc becasue some functions in there were not covered yet.

@fknorr fknorr force-pushed the tracy branch 2 times, most recently from 816c6ea to 8493e15 Compare August 6, 2024 12:50
@coveralls
Copy link

coveralls commented Aug 6, 2024

Pull Request Test Coverage Report for Build 10325120468

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 142 of 146 (97.26%) changed or added relevant lines in 16 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.02%) to 95.087%

Changes Missing Coverage Covered Lines Changed/Added Lines %
include/print_utils.h 28 29 96.55%
src/runtime.cc 30 31 96.77%
src/out_of_order_engine.cc 1 3 33.33%
Totals Coverage Status
Change from base Build 10316923307: 0.02%
Covered Lines: 6624
Relevant Lines: 6722

💛 - Coveralls

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/config.cc Show resolved Hide resolved
src/config.cc Show resolved Hide resolved
test/utils_tests.cc Show resolved Hide resolved
test/utils_tests.cc Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@fknorr fknorr force-pushed the tracy branch 3 times, most recently from 788418c to 111d2b6 Compare August 8, 2024 15:57
Copy link

github-actions bot commented Aug 8, 2024

Check-perf-impact results: (5b2139f73fb4c21b4bcab6e559cd8c5a)

⚠️ Significant slowdown (>1.25x) in some microbenchmark results: benchmark intrusive graph dependency handling with N nodes - 10 / creating nodes
🚀 Significant speedup (<0.80x) in some microbenchmark results: building command- and instruction graphs in a dedicated scheduler thread for N nodes - 1 > immediate submission to a scheduler thread / chain topology

Relative execution time per category: (mean of relative medians)

  • command-graph : 1.03x
  • graph-nodes : 1.07x
  • grid : 1.00x
  • instruction-graph : 1.01x
  • scheduler : 0.99x
  • system : 1.00x
  • task-graph : 1.04x

Copy link
Member

@psalz psalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, ship it!

docs/installation.md Outdated Show resolved Hide resolved
@fknorr fknorr merged commit a9ddc99 into master Aug 9, 2024
5 checks passed
@fknorr fknorr deleted the tracy branch August 9, 2024 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants