Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure and report executor starvation time #322

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

psalz
Copy link
Member

@psalz psalz commented Dec 20, 2024

We previously had a rudimentary Tracy zone for executor starvation, however this was only semi-correct, because if the scheduler is not currently producing any instructions (for example if the user is doing something other than submitting tasks to Celerity), then the executor is not really starving.

I've extended the scheduler to report its idle state to the runtime, which in turn feeds it to the executor for measuring starvation. If the total starvation time exceeds a percentage threshold of the time spent doing actual work (i.e., processing instructions), we print a warning indicating that the application might be scheduler-bound.

Since the scheduler looks at task and command queue sizes to decide whether it is busy or idle, the lookahead mechanism can cause the scheduler to remain busy even if it is not actually producing any instructions. I consider this a feature rather than a bug, because it means that applications that attempt to interleave Celerity kernels with other work will produce a starvation warning if the kernels aren't actually being executed. I've opted to include a hint for this in the warning itself.

The executor is considered to be "starving" when it is out of
instructions to process, and the scheduler is currently busy.
This means that phases of the user program that do not interact with the
Celerity API do not count as starvation periods.

If the total starvation time exceeds a percentage threshold of the time
spent doing actual work (i.e., processing instructions), we print a
warning indicating that the application might be scheduler-bound.
@psalz psalz force-pushed the report-executor-starvation branch from cf2be90 to 26b9c3f Compare December 20, 2024 13:48
Copy link

Check-perf-impact results: (b640933924085610d15d7a9597b095a1)

✔️ No significant performance change in the microbenchmark set. You are good to go!

Relative execution time per category: (mean of relative medians)

  • command-graph : 1.03x
  • graph-nodes : 0.98x
  • grid : 1.03x
  • instruction-graph : 1.02x
  • scheduler : 1.00x
  • system : 1.06x
  • task-graph : 0.94x

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Clang-Tidy found issue(s) with the introduced code (1/1)

@@ -287,3 +287,76 @@ TEST_CASE("scheduler(lookahead::automatic) avoids reallocations in the RSim patt
CHECK(iq.count<free_instruction_record>() == num_devices);
});
}

TEST_CASE("scheduler reports idle and busy phases") { //
struct test_delegate : scheduler::delegate {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ cppcoreguidelines-virtual-class-destructor ⚠️
destructor of test_delegate is public and non-virtual

@coveralls
Copy link

Pull Request Test Coverage Report for Build 12432570567

Details

  • 86 of 86 (100.0%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.05%) to 95.004%

Totals Coverage Status
Change from base Build 12390656669: -0.05%
Covered Lines: 7133
Relevant Lines: 7238

💛 - Coveralls

Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@GagaLP GagaLP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, another simple but very useful debugging tool!

I agree on how starvation time is measured, but I believe that the design can be improved by moving the actual tracking from live_executor to runtime.

Proposal:

  • Move total_starvation_time and total_active_time into class runtime::impl
  • Have on_executor_busy and on_executor_idle in executor_delegate, mirroring scheduler_delegate
  • If either scheduler or executor change between busy/idle states, take a lock and record which of the 4 possible states we are in (each event also means suspending / resuming a thread, so I suppose the overhead is manageable)
  • Remove getters (and dry-run dummy implementations) from executors

This would have the following advantages:

  • executor would not need to concern itself with scheduler state conceptually (they are fully separated atm)
  • executor public API would not expose any query functions, which makes reasoning about its behavior in the multithreaded context easier
  • Cross-thread counter updates would be centralized in runtime (precedence: runtime::impl::m_latest_epoch_reached)
  • idle_state_change would not need to traverse the executor queue, so its reporting would be more accurate in time.

The only downside I see in this is that the Tracy integration will not be able to distinguish between executor starvation and idleness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants