Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide scale-set listener metrics #2559

Merged
merged 26 commits into from
Aug 21, 2023
Merged

Provide scale-set listener metrics #2559

merged 26 commits into from
Aug 21, 2023

Conversation

nikola-jokic
Copy link
Contributor

@nikola-jokic nikola-jokic commented May 3, 2023

Context

Fixes https://github.com/github/c2c-actions/issues/6816

Metrics are being introduced to the new autoscaling runner scale set mode. These metrics are separate from those published by the legacy modes.

The following is the list of metrics the controller-manager and listener pods will be emitting.

Owner Metric Type Description
controller-manager pending_ephemeral_runners gauge Number of ephemeral runners in a pending state.
controller-manager running_ephemeral_runners gauge Number of ephemeral runners in a running state.
controller-manager failed_ephemeral_runners gauge Number of ephemeral runners in a failed state.
listener available_jobs gauge Number of jobs with runs-on matching the runner scale set name. Jobs are not yet assigned to the runner scale set.
listener acquired_jobs gauge Number of jobs acquired by the runner scale set.
listener assigned_jobs gauge Number of jobs assigned to the runner scale set.
listener running_jobs gauge Number of jobs running (or about to be run).
listener registered_runners gauge Number of runners registered by the runner scale set.
listener busy_runners gauge Number of registered runners currently running a job.
listener min_runners gauge Minimum number of runners configured for the runner scale set.
listener max_runners gauge Maximum number of runners configured for the runner scale set.
listener desired_runners gauge Number of runners desired (scale up / down target) by the runner scale set.
listener idle_runners gauge Number of registered runners not running a job.
listener started_jobs_total counter Total number of jobs started since the listener became ready (will reset on pod restart).
listener completed_jobs_total counter Total number of jobs completed since the listener became ready (will reset on pod restart).
listener job_queue_duration_seconds histogram Time spent waiting for workflow jobs to get assigned to the runner scale set after queueing (in seconds).
listener job_startup_duration_seconds histogram Time spent waiting for workflow job to get started on the runner owned by the runner scale set (in seconds).
listener job_execution_duration_seconds histogram Time spent executing workflow jobs by the runner scale set (in seconds).

@nikola-jokic nikola-jokic requested review from mumoshu, toast-gear and a team as code owners May 3, 2023 12:46
@@ -170,3 +170,7 @@ template:
# controllerServiceAccount:
# namespace: arc-system
# name: test-arc-gha-runner-scale-set-controller

metrics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to expose this controller per runner-scale-set?
or we just query the controller and create a service by default of the customer enables service monitor at the controller level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to see the easiest way to set it up
I wanted to create this PR as a POC to have something working and later improve it since it is already a big one
But I'll convert this to draft, base the controller metrics on it to explore the easiest way to combine both of them and create a follow-up PR for controller metrics ☺️

@nikola-jokic nikola-jokic marked this pull request as draft May 8, 2023 12:15
@Link- Link- added the gha-runner-scale-set Related to the gha-runner-scale-set mode label May 9, 2023
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/metrics branch 2 times, most recently from 2d2d70f to 525437f Compare May 23, 2023 10:15
@nikola-jokic nikola-jokic marked this pull request as ready for review May 23, 2023 10:19
# To turn off metrics, specify empty strings for controllerAddr and listenerAddr
metrics:
controllerAddr: ":8080"
listenerAddr: ":8080"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use the fully qualified name?
Ex:

  • controllerManagerAddr
  • autoscalinglistenerAddr

main.go Outdated Show resolved Hide resolved
if err != nil {
log.Error(err, "Github Config URL is invalid", "URL", githubConfigURL)
// stop reconciling on this object
return ctrl.Result{}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is going to trigger reconcile again, while if the URL is invalid, we can't really do anything about it, so we would be just wasting cycles

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not return and keep the reconcile loop?
basically, the publish metrics is best effort?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is right, but the failure is due to the wrong URL. We should never reach this point, the listener would fail, the autoscaling runner set should fail before this. I left it as a precaution, but we should never reach this point. If we are failing to parse GitHub URL, we have a bigger problem somewhere and just in case, this would serve as an indicator. But I don't think we should ever reach this condition

@nikola-jokic nikola-jokic force-pushed the nikola-jokic/metrics branch from 8745fc2 to 3121b06 Compare May 31, 2023 11:38
@Link-
Copy link
Member

Link- commented Jun 13, 2023

Do we want to add E2E tests to this PR or shall we have another task just for that?

@nikola-jokic nikola-jokic force-pushed the nikola-jokic/metrics branch from 6e1417d to 4afe34d Compare June 16, 2023 12:45
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/metrics branch from 4afe34d to 52a354a Compare June 30, 2023 12:07
@nikola-jokic nikola-jokic force-pushed the nikola-jokic/metrics branch from 185c292 to c46829e Compare July 28, 2023 15:10
@Link-
Copy link
Member

Link- commented Aug 18, 2023

TODOs:

  • gha_registered_runners continues to report on the last value the listener receives
    • This requires a server side change
  • Revisit the labels and make sure the common labels are applied
  • Expose number of listeners gauge with all the common labels applied

Copy link
Member

@Link- Link- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant work @nikola-jokic this is an exciting ship!

@nikola-jokic nikola-jokic merged commit a0a3916 into master Aug 21, 2023
@nikola-jokic nikola-jokic deleted the nikola-jokic/metrics branch August 21, 2023 11:50
nikola-jokic added a commit that referenced this pull request Sep 1, 2023
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
@nikola-jokic nikola-jokic restored the nikola-jokic/metrics branch September 26, 2023 09:51
@nikola-jokic nikola-jokic deleted the nikola-jokic/metrics branch September 26, 2023 09:57
@chlunde
Copy link

chlunde commented Oct 30, 2023

@nikola-jokic any chance of getting job_queue_duration_seconds? This seems like a very good metric to monitor the health of our configuration/service quality? Awesome work ❤️

@nikola-jokic
Copy link
Contributor Author

Hey @chlunde, thank you for your kind words! We are planning to introduce this metric in future releases. While testing we noticed a small issue when the scale set transitions from being active to idle, so we decided not to publish this metric until it is fixed ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants