-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure and report executor starvation time #322
base: master
Are you sure you want to change the base?
Conversation
The executor is considered to be "starving" when it is out of instructions to process, and the scheduler is currently busy. This means that phases of the user program that do not interact with the Celerity API do not count as starvation periods. If the total starvation time exceeds a percentage threshold of the time spent doing actual work (i.e., processing instructions), we print a warning indicating that the application might be scheduler-bound.
cf2be90
to
26b9c3f
Compare
Check-perf-impact results: (b640933924085610d15d7a9597b095a1) ✔️ No significant performance change in the microbenchmark set. You are good to go! Relative execution time per category: (mean of relative medians)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang-Tidy
found issue(s) with the introduced code (1/1)
@@ -287,3 +287,76 @@ TEST_CASE("scheduler(lookahead::automatic) avoids reallocations in the RSim patt | |||
CHECK(iq.count<free_instruction_record>() == num_devices); | |||
}); | |||
} | |||
|
|||
TEST_CASE("scheduler reports idle and busy phases") { // | |||
struct test_delegate : scheduler::delegate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destructor of test_delegate
is public and non-virtual
Pull Request Test Coverage Report for Build 12432570567Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, another simple but very useful debugging tool!
I agree on how starvation time is measured, but I believe that the design can be improved by moving the actual tracking from live_executor
to runtime
.
Proposal:
- Move
total_starvation_time
andtotal_active_time
intoclass runtime::impl
- Have
on_executor_busy
andon_executor_idle
inexecutor_delegate
, mirroringscheduler_delegate
- If either scheduler or executor change between busy/idle states, take a lock and record which of the 4 possible states we are in (each event also means suspending / resuming a thread, so I suppose the overhead is manageable)
- Remove getters (and dry-run dummy implementations) from executors
This would have the following advantages:
- executor would not need to concern itself with scheduler state conceptually (they are fully separated atm)
- executor public API would not expose any query functions, which makes reasoning about its behavior in the multithreaded context easier
- Cross-thread counter updates would be centralized in runtime (precedence:
runtime::impl::m_latest_epoch_reached
) idle_state_change
would not need to traverse the executor queue, so its reporting would be more accurate in time.
The only downside I see in this is that the Tracy integration will not be able to distinguish between executor starvation and idleness.
We previously had a rudimentary Tracy zone for executor starvation, however this was only semi-correct, because if the scheduler is not currently producing any instructions (for example if the user is doing something other than submitting tasks to Celerity), then the executor is not really starving.
I've extended the scheduler to report its idle state to the runtime, which in turn feeds it to the executor for measuring starvation. If the total starvation time exceeds a percentage threshold of the time spent doing actual work (i.e., processing instructions), we print a warning indicating that the application might be scheduler-bound.
Since the scheduler looks at task and command queue sizes to decide whether it is busy or idle, the lookahead mechanism can cause the scheduler to remain busy even if it is not actually producing any instructions. I consider this a feature rather than a bug, because it means that applications that attempt to interleave Celerity kernels with other work will produce a starvation warning if the kernels aren't actually being executed. I've opted to include a hint for this in the warning itself.